How to pivot a dataframe
up vote
171
down vote
favorite
- What is pivot?
- How do I pivot?
- Is this a pivot?
- Long format to wide format?
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....
... But I'm going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
Look a few examples from my google search
How to pivot a dataframe in Pandas?- Good question and answer. But the answer only answers the specific question with little explanation.
pandas pivot table to data frame- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
pandas pivoting a dataframe, duplicate rows- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
Setup
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Question(s)
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46Can Subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
python pandas group-by pivot pandas-groupby
add a comment |
up vote
171
down vote
favorite
- What is pivot?
- How do I pivot?
- Is this a pivot?
- Long format to wide format?
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....
... But I'm going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
Look a few examples from my google search
How to pivot a dataframe in Pandas?- Good question and answer. But the answer only answers the specific question with little explanation.
pandas pivot table to data frame- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
pandas pivoting a dataframe, duplicate rows- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
Setup
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Question(s)
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46Can Subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
python pandas group-by pivot pandas-groupby
add a comment |
up vote
171
down vote
favorite
up vote
171
down vote
favorite
- What is pivot?
- How do I pivot?
- Is this a pivot?
- Long format to wide format?
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....
... But I'm going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
Look a few examples from my google search
How to pivot a dataframe in Pandas?- Good question and answer. But the answer only answers the specific question with little explanation.
pandas pivot table to data frame- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
pandas pivoting a dataframe, duplicate rows- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
Setup
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Question(s)
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46Can Subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
python pandas group-by pivot pandas-groupby
- What is pivot?
- How do I pivot?
- Is this a pivot?
- Long format to wide format?
I've seen a lot of questions that ask about pivot tables. Even if they don't know that they are asking about pivot tables, they usually are. It is virtually impossible to write a canonical question and answer that encompasses all aspects of pivoting....
... But I'm going to give it a go.
The problem with existing questions and answers is that often the question is focused on a nuance that the OP has trouble generalizing in order to use a number of the existing good answers. However, none of the answers attempt to give a comprehensive explanation (because it's a daunting task)
Look a few examples from my google search
How to pivot a dataframe in Pandas?- Good question and answer. But the answer only answers the specific question with little explanation.
pandas pivot table to data frame- In this question, the OP is concerned with the output of the pivot. Namely how the columns look. OP wanted it to look like R. This isn't very helpful for pandas users.
pandas pivoting a dataframe, duplicate rows- Another decent question but the answer focuses on one method, namely
pd.DataFrame.pivot
- Another decent question but the answer focuses on one method, namely
So whenever someone searches for pivot
they get sporadic results that are likely not going to answer their specific question.
Setup
You may notice that I conspicuously named my columns and relevant column values to correspond with how I'm going to pivot in the answers below. Pay attention so that you get familiar with where which column names go where to get the results you're looking for.
import numpy as np
import pandas as pd
from numpy.core.defchararray import add
np.random.seed([3,1415])
n = 20
cols = np.array(['key', 'row', 'item', 'col'])
arr1 = (np.random.randint(5, size=(n, 4)) // [2, 1, 2, 1]).astype(str)
df = pd.DataFrame(
add(cols, arr1), columns=cols
).join(
pd.DataFrame(np.random.rand(n, 2).round(2)).add_prefix('val')
)
print(df)
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
5 key1 row2 item2 col4 0.13 0.88
6 key2 row4 item1 col3 0.88 0.39
7 key1 row4 item1 col1 0.10 0.07
8 key1 row0 item2 col4 0.65 0.02
9 key1 row2 item0 col2 0.35 0.61
10 key2 row0 item2 col1 0.40 0.85
11 key2 row4 item1 col2 0.64 0.25
12 key0 row2 item2 col3 0.50 0.44
13 key0 row4 item1 col4 0.24 0.46
14 key1 row3 item2 col3 0.28 0.11
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Question(s)
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
How do I pivot
df
such that thecol
values are columns,row
values are the index, and mean ofval0
are the values?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24Can I get something other than
mean
, like maybesum
?col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24Can I do more that one aggregation at a time?
sum mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.00 0.79 0.50 0.50 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.31 0.00 1.09 0.00 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.10 0.79 1.52 0.24 0.00 0.100 0.395 0.760 0.24Can I aggregate over multiple value columns?
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46Can Subdivide by multiple columns?
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00Or
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
python pandas group-by pivot pandas-groupby
python pandas group-by pivot pandas-groupby
edited Nov 13 '17 at 6:20
asked Nov 7 '17 at 8:00
piRSquared
149k21135275
149k21135275
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
149
down vote
accepted
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and cross_tab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.cross_tab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
149
down vote
accepted
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and cross_tab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.cross_tab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
add a comment |
up vote
149
down vote
accepted
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and cross_tab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.cross_tab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
add a comment |
up vote
149
down vote
accepted
up vote
149
down vote
accepted
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and cross_tab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.cross_tab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in it's purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and cross_tab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.cross_tab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
edited Nov 14 '17 at 17:13
answered Nov 7 '17 at 8:00
piRSquared
149k21135275
149k21135275
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
add a comment |
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
5
5
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
Woah! I missed this 'better than documents' answer all together. Good thing is, I actually get to upvote twice;)
– Vaishali
Nov 13 '17 at 19:04
20
20
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
Could you please consider extending official docs?
– MaxU
Dec 15 '17 at 10:31
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f47152691%2fhow-to-pivot-a-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown