Apply vs transform on a group object










105















Consider the following dataframe:



 A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922


The following commands work:



> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())


but none of the following work:



> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object


Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:



# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)


In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?



For reference, below is the construction of the original dataframe above:



df = pd.DataFrame('A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8))









share|improve this question
























  • The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

    – HYRY
    Dec 17 '14 at 4:24











  • Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

    – Amelio Vazquez-Reina
    Dec 17 '14 at 14:01











  • For those looking for an extremely detailed solution, see this one below.

    – Ted Petrou
    Nov 25 '17 at 17:37















105















Consider the following dataframe:



 A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922


The following commands work:



> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())


but none of the following work:



> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object


Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:



# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)


In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?



For reference, below is the construction of the original dataframe above:



df = pd.DataFrame('A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8))









share|improve this question
























  • The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

    – HYRY
    Dec 17 '14 at 4:24











  • Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

    – Amelio Vazquez-Reina
    Dec 17 '14 at 14:01











  • For those looking for an extremely detailed solution, see this one below.

    – Ted Petrou
    Nov 25 '17 at 17:37













105












105








105


73






Consider the following dataframe:



 A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922


The following commands work:



> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())


but none of the following work:



> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object


Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:



# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)


In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?



For reference, below is the construction of the original dataframe above:



df = pd.DataFrame('A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8))









share|improve this question
















Consider the following dataframe:



 A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922


The following commands work:



> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())


but none of the following work:



> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
TypeError: cannot concatenate a non-NDFrame object


Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:



# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)


In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?



For reference, below is the construction of the original dataframe above:



df = pd.DataFrame('A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : randn(8), 'D' : randn(8))






python pandas






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Dec 22 '14 at 18:30







Amelio Vazquez-Reina

















asked Dec 17 '14 at 2:27









Amelio Vazquez-ReinaAmelio Vazquez-Reina

27.6k75253443




27.6k75253443












  • The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

    – HYRY
    Dec 17 '14 at 4:24











  • Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

    – Amelio Vazquez-Reina
    Dec 17 '14 at 14:01











  • For those looking for an extremely detailed solution, see this one below.

    – Ted Petrou
    Nov 25 '17 at 17:37

















  • The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

    – HYRY
    Dec 17 '14 at 4:24











  • Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

    – Amelio Vazquez-Reina
    Dec 17 '14 at 14:01











  • For those looking for an extremely detailed solution, see this one below.

    – Ted Petrou
    Nov 25 '17 at 17:37
















The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

– HYRY
Dec 17 '14 at 4:24





The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group.

– HYRY
Dec 17 '14 at 4:24













Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

– Amelio Vazquez-Reina
Dec 17 '14 at 14:01





Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing?

– Amelio Vazquez-Reina
Dec 17 '14 at 14:01













For those looking for an extremely detailed solution, see this one below.

– Ted Petrou
Nov 25 '17 at 17:37





For those looking for an extremely detailed solution, see this one below.

– Ted Petrou
Nov 25 '17 at 17:37












3 Answers
3






active

oldest

votes


















133














As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.



My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:



df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())


You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.



So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).



Consider this example (on your dataframe):



zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)


will yield:



 C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653


Which is exactly the same as if you would use it on only on one column at a time:



df.groupby('A')['C'].transform(zscore)


yielding:



0 0.989
1 -0.478
2 0.889
3 -0.671
4 0.034
5 1.149
6 -1.404
7 -0.509


Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:



df.groupby('A').apply(zscore)


gives error:



ValueError: operands could not be broadcast together with shapes (6,) (2,)


So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.



df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


yielding:



 A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373


Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:



df.groupby('A')['C'].apply(sum)


giving:



A
bar 3.973
foo 4.373


There are also cases when .transform is used to filter the data:



df[df.groupby(['B'])['D'].transform(sum) < -1]

A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179


I hope this adds a bit more clarity.






share|improve this answer




















  • 1





    OMG. The difference is so subtle.

    – Dawei
    Jul 10 '18 at 11:43











  • .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

    – cyber-math
    Jan 20 at 4:48


















61














Two major differences between apply and transform



There are two major differences between the transform and apply groupby methods.




  • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

  • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.



Inspecting the custom function



It can help quite a bit to inspect the input to your custom function passed to apply or transform.



Examples



Let's create some sample data and inspect the groups so that you can see what I am talking about:



df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
'a':[4,5,1,3], 'b':[6,10,3,11])
df


Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.



def inspect(x):
print(type(x))
raise


Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:



df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError


As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.



Now, let's do the same thing with transform



df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError


It is passed a Series - a totally different Pandas object.



So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:



def subtract_two(x):
return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')


We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:



df.groupby('State').apply(subtract_two)

State
Florida 2 -2
3 -8
Texas 0 -2
1 -5
dtype: int64


The output is a Series and a little confusing as the original index is kept, but we have access to all columns.




Displaying the passed pandas object



It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:



from IPython.display import display
def subtract_two(x):
display(x)
return x['a'] - x['b']


Screenshot:
enter image description here




Transform must return a single dimensional sequence the same size as the group



The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:



def return_three(x):
return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group


The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:



def rand_group_len(x):
return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208



Returning a single scalar object also works for transform



If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:



def group_sum(x):
return x.sum()

df.groupby('State').transform(group_sum)

a b
0 9 16
1 9 16
2 4 14
3 4 14





share|improve this answer
































    3














    I am going to use a very simple snippet to illustrate the difference:



    test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
    grouping = test.groupby('id')['price']


    The DataFrame looks like this:



     id price 
    0 1 1
    1 2 2
    2 3 3
    3 1 2
    4 2 3
    5 3 1
    6 1 3
    7 2 1
    8 3 2


    There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.



    Now, I want to find the minimum payment made by each customer. There are two ways of doing it:




    1. Using apply:



      grouping.min()



    The return looks like this:



    id
    1 1
    2 1
    3 1
    Name: price, dtype: int64

    pandas.core.series.Series # return type
    Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
    # lenght is 3



    1. Using transform:



      grouping.transform(min)



    The return looks like this:



    0 1
    1 1
    2 1
    3 1
    4 1
    5 1
    6 1
    7 1
    8 1
    Name: price, dtype: int64

    pandas.core.series.Series # return type
    RangeIndex(start=0, stop=9, step=1) # The returned Series' index
    # length is 9


    Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.



    If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.



    If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:



    test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
    test.price - test.minimum # returns the difference for each row


    Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.






    share|improve this answer




















    • 1





      I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

      – Benjamin Dubreu
      Feb 18 at 6:08










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f27517425%2fapply-vs-transform-on-a-group-object%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    133














    As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.



    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:



    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())


    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.



    So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).



    Consider this example (on your dataframe):



    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
    df.groupby('A').transform(zscore)


    will yield:



     C D
    0 0.989 0.128
    1 -0.478 0.489
    2 0.889 -0.589
    3 -0.671 -1.150
    4 0.034 -0.285
    5 1.149 0.662
    6 -1.404 -0.907
    7 -0.509 1.653


    Which is exactly the same as if you would use it on only on one column at a time:



    df.groupby('A')['C'].transform(zscore)


    yielding:



    0 0.989
    1 -0.478
    2 0.889
    3 -0.671
    4 0.034
    5 1.149
    6 -1.404
    7 -0.509


    Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:



    df.groupby('A').apply(zscore)


    gives error:



    ValueError: operands could not be broadcast together with shapes (6,) (2,)


    So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.



    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


    yielding:



     A B C D sum_C
    1 bar one 1.998 0.593 3.973
    3 bar three 1.287 -0.639 3.973
    5 bar two 0.687 -1.027 3.973
    4 foo two 0.205 1.274 4.373
    2 foo two 0.128 0.924 4.373
    6 foo one 2.113 -0.516 4.373
    7 foo three 0.657 -1.179 4.373
    0 foo one 1.270 0.201 4.373


    Trying the same with .apply would give NaNs in sum_C.
    Because .apply would return a reduced Series, which it does not know how to broadcast back:



    df.groupby('A')['C'].apply(sum)


    giving:



    A
    bar 3.973
    foo 4.373


    There are also cases when .transform is used to filter the data:



    df[df.groupby(['B'])['D'].transform(sum) < -1]

    A B C D
    3 bar three 1.287 -0.639
    7 foo three 0.657 -1.179


    I hope this adds a bit more clarity.






    share|improve this answer




















    • 1





      OMG. The difference is so subtle.

      – Dawei
      Jul 10 '18 at 11:43











    • .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

      – cyber-math
      Jan 20 at 4:48















    133














    As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.



    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:



    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())


    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.



    So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).



    Consider this example (on your dataframe):



    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
    df.groupby('A').transform(zscore)


    will yield:



     C D
    0 0.989 0.128
    1 -0.478 0.489
    2 0.889 -0.589
    3 -0.671 -1.150
    4 0.034 -0.285
    5 1.149 0.662
    6 -1.404 -0.907
    7 -0.509 1.653


    Which is exactly the same as if you would use it on only on one column at a time:



    df.groupby('A')['C'].transform(zscore)


    yielding:



    0 0.989
    1 -0.478
    2 0.889
    3 -0.671
    4 0.034
    5 1.149
    6 -1.404
    7 -0.509


    Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:



    df.groupby('A').apply(zscore)


    gives error:



    ValueError: operands could not be broadcast together with shapes (6,) (2,)


    So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.



    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


    yielding:



     A B C D sum_C
    1 bar one 1.998 0.593 3.973
    3 bar three 1.287 -0.639 3.973
    5 bar two 0.687 -1.027 3.973
    4 foo two 0.205 1.274 4.373
    2 foo two 0.128 0.924 4.373
    6 foo one 2.113 -0.516 4.373
    7 foo three 0.657 -1.179 4.373
    0 foo one 1.270 0.201 4.373


    Trying the same with .apply would give NaNs in sum_C.
    Because .apply would return a reduced Series, which it does not know how to broadcast back:



    df.groupby('A')['C'].apply(sum)


    giving:



    A
    bar 3.973
    foo 4.373


    There are also cases when .transform is used to filter the data:



    df[df.groupby(['B'])['D'].transform(sum) < -1]

    A B C D
    3 bar three 1.287 -0.639
    7 foo three 0.657 -1.179


    I hope this adds a bit more clarity.






    share|improve this answer




















    • 1





      OMG. The difference is so subtle.

      – Dawei
      Jul 10 '18 at 11:43











    • .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

      – cyber-math
      Jan 20 at 4:48













    133












    133








    133







    As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.



    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:



    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())


    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.



    So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).



    Consider this example (on your dataframe):



    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
    df.groupby('A').transform(zscore)


    will yield:



     C D
    0 0.989 0.128
    1 -0.478 0.489
    2 0.889 -0.589
    3 -0.671 -1.150
    4 0.034 -0.285
    5 1.149 0.662
    6 -1.404 -0.907
    7 -0.509 1.653


    Which is exactly the same as if you would use it on only on one column at a time:



    df.groupby('A')['C'].transform(zscore)


    yielding:



    0 0.989
    1 -0.478
    2 0.889
    3 -0.671
    4 0.034
    5 1.149
    6 -1.404
    7 -0.509


    Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:



    df.groupby('A').apply(zscore)


    gives error:



    ValueError: operands could not be broadcast together with shapes (6,) (2,)


    So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.



    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


    yielding:



     A B C D sum_C
    1 bar one 1.998 0.593 3.973
    3 bar three 1.287 -0.639 3.973
    5 bar two 0.687 -1.027 3.973
    4 foo two 0.205 1.274 4.373
    2 foo two 0.128 0.924 4.373
    6 foo one 2.113 -0.516 4.373
    7 foo three 0.657 -1.179 4.373
    0 foo one 1.270 0.201 4.373


    Trying the same with .apply would give NaNs in sum_C.
    Because .apply would return a reduced Series, which it does not know how to broadcast back:



    df.groupby('A')['C'].apply(sum)


    giving:



    A
    bar 3.973
    foo 4.373


    There are also cases when .transform is used to filter the data:



    df[df.groupby(['B'])['D'].transform(sum) < -1]

    A B C D
    3 bar three 1.287 -0.639
    7 foo three 0.657 -1.179


    I hope this adds a bit more clarity.






    share|improve this answer















    As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.



    My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:



    df.groupby('A').transform(lambda x: (x['C'] - x['D']))
    df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())


    You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.



    So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).



    Consider this example (on your dataframe):



    zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
    df.groupby('A').transform(zscore)


    will yield:



     C D
    0 0.989 0.128
    1 -0.478 0.489
    2 0.889 -0.589
    3 -0.671 -1.150
    4 0.034 -0.285
    5 1.149 0.662
    6 -1.404 -0.907
    7 -0.509 1.653


    Which is exactly the same as if you would use it on only on one column at a time:



    df.groupby('A')['C'].transform(zscore)


    yielding:



    0 0.989
    1 -0.478
    2 0.889
    3 -0.671
    4 0.034
    5 1.149
    6 -1.404
    7 -0.509


    Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:



    df.groupby('A').apply(zscore)


    gives error:



    ValueError: operands could not be broadcast together with shapes (6,) (2,)


    So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.



    df['sum_C'] = df.groupby('A')['C'].transform(sum)
    df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group


    yielding:



     A B C D sum_C
    1 bar one 1.998 0.593 3.973
    3 bar three 1.287 -0.639 3.973
    5 bar two 0.687 -1.027 3.973
    4 foo two 0.205 1.274 4.373
    2 foo two 0.128 0.924 4.373
    6 foo one 2.113 -0.516 4.373
    7 foo three 0.657 -1.179 4.373
    0 foo one 1.270 0.201 4.373


    Trying the same with .apply would give NaNs in sum_C.
    Because .apply would return a reduced Series, which it does not know how to broadcast back:



    df.groupby('A')['C'].apply(sum)


    giving:



    A
    bar 3.973
    foo 4.373


    There are also cases when .transform is used to filter the data:



    df[df.groupby(['B'])['D'].transform(sum) < -1]

    A B C D
    3 bar three 1.287 -0.639
    7 foo three 0.657 -1.179


    I hope this adds a bit more clarity.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited May 23 '17 at 12:34









    Community

    11




    11










    answered Jan 14 '15 at 20:34









    PrimerPrimer

    6,43632037




    6,43632037







    • 1





      OMG. The difference is so subtle.

      – Dawei
      Jul 10 '18 at 11:43











    • .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

      – cyber-math
      Jan 20 at 4:48












    • 1





      OMG. The difference is so subtle.

      – Dawei
      Jul 10 '18 at 11:43











    • .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

      – cyber-math
      Jan 20 at 4:48







    1




    1





    OMG. The difference is so subtle.

    – Dawei
    Jul 10 '18 at 11:43





    OMG. The difference is so subtle.

    – Dawei
    Jul 10 '18 at 11:43













    .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

    – cyber-math
    Jan 20 at 4:48





    .transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well.

    – cyber-math
    Jan 20 at 4:48













    61














    Two major differences between apply and transform



    There are two major differences between the transform and apply groupby methods.




    • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

    • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

    So, transform works on just one Series at a time and apply works on the entire DataFrame at once.



    Inspecting the custom function



    It can help quite a bit to inspect the input to your custom function passed to apply or transform.



    Examples



    Let's create some sample data and inspect the groups so that you can see what I am talking about:



    df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
    'a':[4,5,1,3], 'b':[6,10,3,11])
    df


    Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.



    def inspect(x):
    print(type(x))
    raise


    Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:



    df.groupby('State').apply(inspect)

    <class 'pandas.core.frame.DataFrame'>
    <class 'pandas.core.frame.DataFrame'>
    RuntimeError


    As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.



    Now, let's do the same thing with transform



    df.groupby('State').transform(inspect)
    <class 'pandas.core.series.Series'>
    <class 'pandas.core.series.Series'>
    RuntimeError


    It is passed a Series - a totally different Pandas object.



    So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:



    def subtract_two(x):
    return x['a'] - x['b']

    df.groupby('State').transform(subtract_two)
    KeyError: ('a', 'occurred at index a')


    We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:



    df.groupby('State').apply(subtract_two)

    State
    Florida 2 -2
    3 -8
    Texas 0 -2
    1 -5
    dtype: int64


    The output is a Series and a little confusing as the original index is kept, but we have access to all columns.




    Displaying the passed pandas object



    It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:



    from IPython.display import display
    def subtract_two(x):
    display(x)
    return x['a'] - x['b']


    Screenshot:
    enter image description here




    Transform must return a single dimensional sequence the same size as the group



    The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:



    def return_three(x):
    return np.array([1, 2, 3])

    df.groupby('State').transform(return_three)
    ValueError: transform must return a scalar value for each group


    The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:



    def rand_group_len(x):
    return np.random.rand(len(x))

    df.groupby('State').transform(rand_group_len)

    a b
    0 0.962070 0.151440
    1 0.440956 0.782176
    2 0.642218 0.483257
    3 0.056047 0.238208



    Returning a single scalar object also works for transform



    If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:



    def group_sum(x):
    return x.sum()

    df.groupby('State').transform(group_sum)

    a b
    0 9 16
    1 9 16
    2 4 14
    3 4 14





    share|improve this answer





























      61














      Two major differences between apply and transform



      There are two major differences between the transform and apply groupby methods.




      • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

      • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

      So, transform works on just one Series at a time and apply works on the entire DataFrame at once.



      Inspecting the custom function



      It can help quite a bit to inspect the input to your custom function passed to apply or transform.



      Examples



      Let's create some sample data and inspect the groups so that you can see what I am talking about:



      df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
      'a':[4,5,1,3], 'b':[6,10,3,11])
      df


      Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.



      def inspect(x):
      print(type(x))
      raise


      Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:



      df.groupby('State').apply(inspect)

      <class 'pandas.core.frame.DataFrame'>
      <class 'pandas.core.frame.DataFrame'>
      RuntimeError


      As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.



      Now, let's do the same thing with transform



      df.groupby('State').transform(inspect)
      <class 'pandas.core.series.Series'>
      <class 'pandas.core.series.Series'>
      RuntimeError


      It is passed a Series - a totally different Pandas object.



      So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:



      def subtract_two(x):
      return x['a'] - x['b']

      df.groupby('State').transform(subtract_two)
      KeyError: ('a', 'occurred at index a')


      We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:



      df.groupby('State').apply(subtract_two)

      State
      Florida 2 -2
      3 -8
      Texas 0 -2
      1 -5
      dtype: int64


      The output is a Series and a little confusing as the original index is kept, but we have access to all columns.




      Displaying the passed pandas object



      It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:



      from IPython.display import display
      def subtract_two(x):
      display(x)
      return x['a'] - x['b']


      Screenshot:
      enter image description here




      Transform must return a single dimensional sequence the same size as the group



      The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:



      def return_three(x):
      return np.array([1, 2, 3])

      df.groupby('State').transform(return_three)
      ValueError: transform must return a scalar value for each group


      The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:



      def rand_group_len(x):
      return np.random.rand(len(x))

      df.groupby('State').transform(rand_group_len)

      a b
      0 0.962070 0.151440
      1 0.440956 0.782176
      2 0.642218 0.483257
      3 0.056047 0.238208



      Returning a single scalar object also works for transform



      If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:



      def group_sum(x):
      return x.sum()

      df.groupby('State').transform(group_sum)

      a b
      0 9 16
      1 9 16
      2 4 14
      3 4 14





      share|improve this answer



























        61












        61








        61







        Two major differences between apply and transform



        There are two major differences between the transform and apply groupby methods.




        • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

        • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

        So, transform works on just one Series at a time and apply works on the entire DataFrame at once.



        Inspecting the custom function



        It can help quite a bit to inspect the input to your custom function passed to apply or transform.



        Examples



        Let's create some sample data and inspect the groups so that you can see what I am talking about:



        df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
        'a':[4,5,1,3], 'b':[6,10,3,11])
        df


        Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.



        def inspect(x):
        print(type(x))
        raise


        Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:



        df.groupby('State').apply(inspect)

        <class 'pandas.core.frame.DataFrame'>
        <class 'pandas.core.frame.DataFrame'>
        RuntimeError


        As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.



        Now, let's do the same thing with transform



        df.groupby('State').transform(inspect)
        <class 'pandas.core.series.Series'>
        <class 'pandas.core.series.Series'>
        RuntimeError


        It is passed a Series - a totally different Pandas object.



        So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:



        def subtract_two(x):
        return x['a'] - x['b']

        df.groupby('State').transform(subtract_two)
        KeyError: ('a', 'occurred at index a')


        We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:



        df.groupby('State').apply(subtract_two)

        State
        Florida 2 -2
        3 -8
        Texas 0 -2
        1 -5
        dtype: int64


        The output is a Series and a little confusing as the original index is kept, but we have access to all columns.




        Displaying the passed pandas object



        It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:



        from IPython.display import display
        def subtract_two(x):
        display(x)
        return x['a'] - x['b']


        Screenshot:
        enter image description here




        Transform must return a single dimensional sequence the same size as the group



        The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:



        def return_three(x):
        return np.array([1, 2, 3])

        df.groupby('State').transform(return_three)
        ValueError: transform must return a scalar value for each group


        The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:



        def rand_group_len(x):
        return np.random.rand(len(x))

        df.groupby('State').transform(rand_group_len)

        a b
        0 0.962070 0.151440
        1 0.440956 0.782176
        2 0.642218 0.483257
        3 0.056047 0.238208



        Returning a single scalar object also works for transform



        If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:



        def group_sum(x):
        return x.sum()

        df.groupby('State').transform(group_sum)

        a b
        0 9 16
        1 9 16
        2 4 14
        3 4 14





        share|improve this answer















        Two major differences between apply and transform



        There are two major differences between the transform and apply groupby methods.




        • apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

        • The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

        So, transform works on just one Series at a time and apply works on the entire DataFrame at once.



        Inspecting the custom function



        It can help quite a bit to inspect the input to your custom function passed to apply or transform.



        Examples



        Let's create some sample data and inspect the groups so that you can see what I am talking about:



        df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
        'a':[4,5,1,3], 'b':[6,10,3,11])
        df


        Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.



        def inspect(x):
        print(type(x))
        raise


        Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:



        df.groupby('State').apply(inspect)

        <class 'pandas.core.frame.DataFrame'>
        <class 'pandas.core.frame.DataFrame'>
        RuntimeError


        As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.



        Now, let's do the same thing with transform



        df.groupby('State').transform(inspect)
        <class 'pandas.core.series.Series'>
        <class 'pandas.core.series.Series'>
        RuntimeError


        It is passed a Series - a totally different Pandas object.



        So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:



        def subtract_two(x):
        return x['a'] - x['b']

        df.groupby('State').transform(subtract_two)
        KeyError: ('a', 'occurred at index a')


        We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:



        df.groupby('State').apply(subtract_two)

        State
        Florida 2 -2
        3 -8
        Texas 0 -2
        1 -5
        dtype: int64


        The output is a Series and a little confusing as the original index is kept, but we have access to all columns.




        Displaying the passed pandas object



        It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:



        from IPython.display import display
        def subtract_two(x):
        display(x)
        return x['a'] - x['b']


        Screenshot:
        enter image description here




        Transform must return a single dimensional sequence the same size as the group



        The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:



        def return_three(x):
        return np.array([1, 2, 3])

        df.groupby('State').transform(return_three)
        ValueError: transform must return a scalar value for each group


        The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:



        def rand_group_len(x):
        return np.random.rand(len(x))

        df.groupby('State').transform(rand_group_len)

        a b
        0 0.962070 0.151440
        1 0.440956 0.782176
        2 0.642218 0.483257
        3 0.056047 0.238208



        Returning a single scalar object also works for transform



        If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:



        def group_sum(x):
        return x.sum()

        df.groupby('State').transform(group_sum)

        a b
        0 9 16
        1 9 16
        2 4 14
        3 4 14






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 6 '17 at 18:09

























        answered Nov 6 '17 at 18:03









        Ted PetrouTed Petrou

        24.3k97468




        24.3k97468





















            3














            I am going to use a very simple snippet to illustrate the difference:



            test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
            grouping = test.groupby('id')['price']


            The DataFrame looks like this:



             id price 
            0 1 1
            1 2 2
            2 3 3
            3 1 2
            4 2 3
            5 3 1
            6 1 3
            7 2 1
            8 3 2


            There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.



            Now, I want to find the minimum payment made by each customer. There are two ways of doing it:




            1. Using apply:



              grouping.min()



            The return looks like this:



            id
            1 1
            2 1
            3 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
            # lenght is 3



            1. Using transform:



              grouping.transform(min)



            The return looks like this:



            0 1
            1 1
            2 1
            3 1
            4 1
            5 1
            6 1
            7 1
            8 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            RangeIndex(start=0, stop=9, step=1) # The returned Series' index
            # length is 9


            Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.



            If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.



            If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:



            test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
            test.price - test.minimum # returns the difference for each row


            Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.






            share|improve this answer




















            • 1





              I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

              – Benjamin Dubreu
              Feb 18 at 6:08















            3














            I am going to use a very simple snippet to illustrate the difference:



            test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
            grouping = test.groupby('id')['price']


            The DataFrame looks like this:



             id price 
            0 1 1
            1 2 2
            2 3 3
            3 1 2
            4 2 3
            5 3 1
            6 1 3
            7 2 1
            8 3 2


            There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.



            Now, I want to find the minimum payment made by each customer. There are two ways of doing it:




            1. Using apply:



              grouping.min()



            The return looks like this:



            id
            1 1
            2 1
            3 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
            # lenght is 3



            1. Using transform:



              grouping.transform(min)



            The return looks like this:



            0 1
            1 1
            2 1
            3 1
            4 1
            5 1
            6 1
            7 1
            8 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            RangeIndex(start=0, stop=9, step=1) # The returned Series' index
            # length is 9


            Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.



            If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.



            If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:



            test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
            test.price - test.minimum # returns the difference for each row


            Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.






            share|improve this answer




















            • 1





              I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

              – Benjamin Dubreu
              Feb 18 at 6:08













            3












            3








            3







            I am going to use a very simple snippet to illustrate the difference:



            test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
            grouping = test.groupby('id')['price']


            The DataFrame looks like this:



             id price 
            0 1 1
            1 2 2
            2 3 3
            3 1 2
            4 2 3
            5 3 1
            6 1 3
            7 2 1
            8 3 2


            There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.



            Now, I want to find the minimum payment made by each customer. There are two ways of doing it:




            1. Using apply:



              grouping.min()



            The return looks like this:



            id
            1 1
            2 1
            3 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
            # lenght is 3



            1. Using transform:



              grouping.transform(min)



            The return looks like this:



            0 1
            1 1
            2 1
            3 1
            4 1
            5 1
            6 1
            7 1
            8 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            RangeIndex(start=0, stop=9, step=1) # The returned Series' index
            # length is 9


            Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.



            If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.



            If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:



            test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
            test.price - test.minimum # returns the difference for each row


            Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.






            share|improve this answer















            I am going to use a very simple snippet to illustrate the difference:



            test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
            grouping = test.groupby('id')['price']


            The DataFrame looks like this:



             id price 
            0 1 1
            1 2 2
            2 3 3
            3 1 2
            4 2 3
            5 3 1
            6 1 3
            7 2 1
            8 3 2


            There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.



            Now, I want to find the minimum payment made by each customer. There are two ways of doing it:




            1. Using apply:



              grouping.min()



            The return looks like this:



            id
            1 1
            2 1
            3 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
            # lenght is 3



            1. Using transform:



              grouping.transform(min)



            The return looks like this:



            0 1
            1 1
            2 1
            3 1
            4 1
            5 1
            6 1
            7 1
            8 1
            Name: price, dtype: int64

            pandas.core.series.Series # return type
            RangeIndex(start=0, stop=9, step=1) # The returned Series' index
            # length is 9


            Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.



            If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.



            If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:



            test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
            test.price - test.minimum # returns the difference for each row


            Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Feb 25 at 3:10

























            answered Dec 30 '18 at 3:27









            ChengCheng

            6,32784070




            6,32784070







            • 1





              I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

              – Benjamin Dubreu
              Feb 18 at 6:08












            • 1





              I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

              – Benjamin Dubreu
              Feb 18 at 6:08







            1




            1





            I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

            – Benjamin Dubreu
            Feb 18 at 6:08





            I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked !

            – Benjamin Dubreu
            Feb 18 at 6:08

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f27517425%2fapply-vs-transform-on-a-group-object%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3