Odtnhj

Question

Consider the following dataframe:

 A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.322057
4 foo two 0.411452 -0.954371
5 bar two 0.765878 -0.095968
6 foo one -0.654890 0.678091
7 foo three -1.789842 -1.130922

The following commands work:

> df.groupby('A').apply(lambda x: (x['C'] - x['D']))
> df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

but none of the following work:

> df.groupby('A').transform(lambda x: (x['C'] - x['D']))
ValueError: could not broadcast input array from shape (5) into shape (5,3)

> df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
 TypeError: cannot concatenate a non-NDFrame object

Why? The example on the documentation seems to suggest that calling transform on a group allows one to do row-wise operation processing:

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

In other words, I thought that transform is essentially a specific type of apply (the one that does not aggregate). Where am I wrong?

For reference, below is the construction of the original dataframe above:

df = pd.DataFrame('A' : ['foo', 'bar', 'foo', 'bar',
 'foo', 'bar', 'foo', 'foo'],
 'B' : ['one', 'one', 'two', 'three',
 'two', 'two', 'one', 'three'],
 'C' : randn(8), 'D' : randn(8))

The function passed to transform must return a number, a row, or the same shape as the argument. if it's a number then the number will be set to all the elements in the group, if it's a row, it will be broadcasted to all the rows in the group. In your code, the lambda function return a column which can't be broadcasted to the group. — Dec 17 '14 at 4:24
Thanks @HYRY, but I am confused. If you look at the example in the documentation that I copied above (i.e. with zscore), transform receives a lambda function that assumes each x is an item within the group, and also returns a value per item in the group. What am I missing? — Dec 17 '14 at 14:01
For those looking for an extremely detailed solution, see this one below. — Nov 25 '17 at 17:37

Community♦ 11 · Accepted Answer · 2017-05-23 12:34:42Z

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

 C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653

Which is exactly the same as if you would use it on only on one column at a time:

df.groupby('A')['C'].transform(zscore)

yielding:

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

df.groupby('A').apply(zscore)

gives error:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

 A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373

Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:

df.groupby('A')['C'].apply(sum)

giving:

A
bar 3.973
foo 4.373

There are also cases when .transform is used to filter the data:

df[df.groupby(['B'])['D'].transform(sum) < -1]

 A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179

I hope this adds a bit more clarity.

.transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well. — Jan 20 at 4:48

score 61 · Accepted Answer · 2017-11-06 18:09:13Z

Two major differences between `apply` and `transform`

There are two major differences between the transform and apply groupby methods.

apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let's create some sample data and inspect the groups so that you can see what I am talking about:

df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
 'a':[4,5,1,3], 'b':[6,10,3,11])
df

Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

def inspect(x):
 print(type(x))
 raise

Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

Now, let's do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series - a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
 return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State 
Florida 2 -2
 3 -8
Texas 0 -2
 1 -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.

Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
 display(x)
 return x['a'] - x['b']

Screenshot:
enter image description here

Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
 return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
 return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

 a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208

Returning a single scalar object also works for `transform`

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
 return x.sum()

df.groupby('State').transform(group_sum)

 a b
0 9 16
1 9 16
2 4 14
3 4 14

score 3 · Accepted Answer · 2019-02-25 03:10:23Z

I am going to use a very simple snippet to illustrate the difference:

test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
grouping = test.groupby('id')['price']

The DataFrame looks like this:

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

Using apply:

grouping.min()

The return looks like this:

id
1 1
2 1
3 1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3

Using transform:

grouping.transform(min)

The return looks like this:

0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9

Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.

If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.

If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.

I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked ! — Feb 18 at 6:08

Community♦ 11 · Accepted Answer · 2017-05-23 12:34:42Z

As I felt similarly confused with .transform operation vs. .apply I found a few answers shedding some light on the issue. This answer for example was very helpful.

My takeout so far is that .transform will work (or deal) with Series (columns) in isolation from each other. What this means is that in your last two calls:

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

You asked .transform to take values from two columns and 'it' actually does not 'see' both of them at the same time (so to speak). transform will look at the dataframe columns one by one and return back a series (or group of series) 'made' of scalars which are repeated len(input_column) times.

So this scalar, that should be used by .transform to make the Series is a result of some reduction function applied on an input Series (and only on ONE series/column at a time).

Consider this example (on your dataframe):

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

will yield:

 C D
0 0.989 0.128
1 -0.478 0.489
2 0.889 -0.589
3 -0.671 -1.150
4 0.034 -0.285
5 1.149 0.662
6 -1.404 -0.907
7 -0.509 1.653

Which is exactly the same as if you would use it on only on one column at a time:

df.groupby('A')['C'].transform(zscore)

yielding:

Note that .apply in the last example (df.groupby('A')['C'].apply(zscore)) would work in exactly the same way, but it would fail if you tried using it on a dataframe:

df.groupby('A').apply(zscore)

gives error:

ValueError: operands could not be broadcast together with shapes (6,) (2,)

So where else is .transform useful? The simplest case is trying to assign results of reduction function back to original dataframe.

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

yielding:

 A B C D sum_C
1 bar one 1.998 0.593 3.973
3 bar three 1.287 -0.639 3.973
5 bar two 0.687 -1.027 3.973
4 foo two 0.205 1.274 4.373
2 foo two 0.128 0.924 4.373
6 foo one 2.113 -0.516 4.373
7 foo three 0.657 -1.179 4.373
0 foo one 1.270 0.201 4.373

Trying the same with .apply would give NaNs in sum_C.
Because .apply would return a reduced Series, which it does not know how to broadcast back:

df.groupby('A')['C'].apply(sum)

giving:

A
bar 3.973
foo 4.373

There are also cases when .transform is used to filter the data:

df[df.groupby(['B'])['D'].transform(sum) < -1]

 A B C D
3 bar three 1.287 -0.639
7 foo three 0.657 -1.179

I hope this adds a bit more clarity.

.transform() could be also used for filling missing values. Especially if you want to broadcast group mean or group statistic to NaN values in that group. Unfortunately, pandas documentation was not helpful to me as well. — Jan 20 at 4:48

score 61 · Accepted Answer · 2017-11-06 18:09:13Z

Two major differences between `apply` and `transform`

There are two major differences between the transform and apply groupby methods.

apply implicitly passes all the columns for each group as a DataFrame to the custom function, while transform passes each column for each group as a Series to the custom function

The custom function passed to apply can return a scalar, or a Series or DataFrame (or numpy array or even list). The custom function passed to transform must return a sequence (a one dimensional Series, array or list) the same length as the group.

So, transform works on just one Series at a time and apply works on the entire DataFrame at once.

Inspecting the custom function

It can help quite a bit to inspect the input to your custom function passed to apply or transform.

Examples

Let's create some sample data and inspect the groups so that you can see what I am talking about:

df = pd.DataFrame('State':['Texas', 'Texas', 'Florida', 'Florida'], 
 'a':[4,5,1,3], 'b':[6,10,3,11])
df

Let's create a simple custom function that prints out the type of the implicitly passed object and then raised an error so that execution can be stopped.

def inspect(x):
 print(type(x))
 raise

Now let's pass this function to both the groupby apply and transform methods to see what object is passed to it:

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

As you can see, a DataFrame is passed into the inspect function. You might be wondering why the type, DataFrame, got printed out twice. Pandas runs the first group twice. It does this to determine if there is a fast way to complete the computation or not. This is a minor detail that you shouldn't worry about.

Now, let's do the same thing with transform

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

It is passed a Series - a totally different Pandas object.

So, transform is only allowed to work with a single Series at a time. It is impossible for it to act on two columns at the same time. So, if we try and subtract column a from b inside of our custom function we would get an error with transform. See below:

def subtract_two(x):
 return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

We get a KeyError as pandas is attempting to find the Series index a which does not exist. You can complete this operation with apply as it has the entire DataFrame:

df.groupby('State').apply(subtract_two)

State 
Florida 2 -2
 3 -8
Texas 0 -2
 1 -5
dtype: int64

The output is a Series and a little confusing as the original index is kept, but we have access to all columns.

Displaying the passed pandas object

It can help even more to display the entire pandas object within the custom function, so you can see exactly what you are operating with. You can use print statements by I like to use the display function from the IPython.display module so that the DataFrames get nicely outputted in HTML in a jupyter notebook:

from IPython.display import display
def subtract_two(x):
 display(x)
 return x['a'] - x['b']

Screenshot:
enter image description here

Transform must return a single dimensional sequence the same size as the group

The other difference is that transform must return a single dimensional sequence the same size as the group. In this particular instance, each group has two rows, so transform must return a sequence of two rows. If it does not then an error is raised:

def return_three(x):
 return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

The error message is not really descriptive of the problem. You must return a sequence the same length as the group. So, a function like this would work:

def rand_group_len(x):
 return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

 a b
0 0.962070 0.151440
1 0.440956 0.782176
2 0.642218 0.483257
3 0.056047 0.238208

Returning a single scalar object also works for `transform`

If you return just a single scalar from your custom function, then transform will use it for each of the rows in the group:

def group_sum(x):
 return x.sum()

df.groupby('State').transform(group_sum)

 a b
0 9 16
1 9 16
2 4 14
3 4 14

score 3 · Accepted Answer · 2019-02-25 03:10:23Z

I am going to use a very simple snippet to illustrate the difference:

test = pd.DataFrame('id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2])
grouping = test.groupby('id')['price']

The DataFrame looks like this:

There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.

Now, I want to find the minimum payment made by each customer. There are two ways of doing it:

Using apply:

grouping.min()

The return looks like this:

id
1 1
2 1
3 1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3

Using transform:

grouping.transform(min)

The return looks like this:

0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9

Both methods return a Series object, but the length of the first one is 3 and the length of the second one is 9.

If you want to answer What is the minimum price paid by each customer, then the apply method is the more suitable one to choose.

If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment, then you want to use transform, because:

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.

I think this is a great answer ! Thanks for taking the time to make an answer more than four years after the question was asked ! — Feb 18 at 6:08

Apply vs transform on a group object

3 Answers 3

Two major differences between apply and transform

Inspecting the custom function

Examples

Displaying the passed pandas object

Transform must return a single dimensional sequence the same size as the group

Returning a single scalar object also works for transform

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Two major differences between apply and transform

Inspecting the custom function

Examples

Displaying the passed pandas object

Transform must return a single dimensional sequence the same size as the group

Returning a single scalar object also works for transform

Two major differences between apply and transform

Inspecting the custom function

Examples

Displaying the passed pandas object

Transform must return a single dimensional sequence the same size as the group

Returning a single scalar object also works for transform

Two major differences between apply and transform

Inspecting the custom function

Examples

Displaying the passed pandas object

Transform must return a single dimensional sequence the same size as the group

Returning a single scalar object also works for transform

Two major differences between apply and transform

Inspecting the custom function

Examples

Displaying the passed pandas object

Transform must return a single dimensional sequence the same size as the group

Returning a single scalar object also works for transform

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto

3 Answers
3

Two major differences between `apply` and `transform`

Returning a single scalar object also works for `transform`

3 Answers
3

3 Answers
3

Two major differences between `apply` and `transform`

Returning a single scalar object also works for `transform`

Two major differences between `apply` and `transform`

Returning a single scalar object also works for `transform`

Two major differences between `apply` and `transform`

Returning a single scalar object also works for `transform`

Two major differences between `apply` and `transform`

Returning a single scalar object also works for `transform`