Reshape pandas dataframe to turn categorical columns into individual columns

I have data that looks like this:

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86

The Category and Quantity columns refer to the corresponding the Substance columns.

I want to expand the Category columns as a new column for each unique value and have the Quantity value as the cell value. Non-existant categories would be NaN. So the resulting frame would look like this:

 ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49 
4 E 92 86 NaN NaN

I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.

asked Nov 14 '18 at 21:56

robroc

4851313

add a comment |

I have data that looks like this:

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86

The Category and Quantity columns refer to the corresponding the Substance columns.

 ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49 
4 E 92 86 NaN NaN

I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.

asked Nov 14 '18 at 21:56

robroc

4851313

add a comment |

I have data that looks like this:

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86

The Category and Quantity columns refer to the corresponding the Substance columns.

 ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49 
4 E 92 86 NaN NaN

I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.

asked Nov 14 '18 at 21:56

robroc

4851313

I have data that looks like this:

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 85 14
1 B Natural Gas Salt water Gas Water 95 78
2 C Gasoline Waste water Refined Water 33 25
3 D Diesel Motor oil Refined Oil 49 54
4 E Bitumen Sour Gas Oil Gas 92 86

The Category and Quantity columns refer to the corresponding the Substance columns.

 ID Oil Gas Water Refined
0 A 85 14 NaN NaN
1 B NaN 95 78 NaN
2 C NaN NaN 25 33
3 D 54 NaN NaN 49 
4 E 92 86 NaN NaN

I tried .melt() followed by .pivot_table() but for some reason values get duplicated across the new category columns.

python pandas

asked Nov 14 '18 at 21:56

robroc

4851313

asked Nov 14 '18 at 21:56

robroc

4851313

asked Nov 14 '18 at 21:56

robroc

4851313

asked Nov 14 '18 at 21:56

robroc

4851313

asked Nov 14 '18 at 21:56

robroc

4851313

add a comment |

2 Answers
2

active

oldest

votes

You need to use pd.melt then groupby:

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')
 .groupby(['ID','Category'])['Quantity'].sum()
 .unstack().reset_index()

Output:

Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

1

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked: pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

add a comment |

Here is my semi-manual approach:

>>> df
 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 74 49
1 B Natural Gas Salt water Gas Water 75 91
2 C Gasoline Waste water Refined Water 24 38
3 D Diesel Motor oil Refined Oil 19 95
4 E Bitumen Sour Gas Oil Gas 50 35
>>> newdf=pd.DataFrame(columns=set(df[['Category1','Category2']].values.flatten()),index=df.index)
>>> for name in newdf: 
 newdf[name]=pd.concat([df[df['Category1']==name]['Quantity1'],df[df['Category2']==name]['Quantity2']])
...
>>> newdf
 Gas Oil Water Refined
0 49 74 NaN NaN
1 75 NaN 91 NaN
2 NaN NaN 38 24
3 NaN 95 NaN 19
4 35 50 NaN NaN

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309324%2freshape-pandas-dataframe-to-turn-categorical-columns-into-individual-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You need to use pd.melt then groupby:

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')
 .groupby(['ID','Category'])['Quantity'].sum()
 .unstack().reset_index()

Output:

Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

1

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked: pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

add a comment |

You need to use pd.melt then groupby:

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')
 .groupby(['ID','Category'])['Quantity'].sum()
 .unstack().reset_index()

Output:

Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

1

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked: pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

add a comment |

You need to use pd.melt then groupby:

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')
 .groupby(['ID','Category'])['Quantity'].sum()
 .unstack().reset_index()

Output:

Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

You need to use pd.melt then groupby:

np.random.seed(0)

df = pd.DataFrame(data=[list('ABCDE'), 
 ['Crude Oil', 'Natural Gas', 'Gasoline', 'Diesel', 'Bitumen'],
 ['Natural Gas', 'Salt water', 'Waste water', 'Motor oil', 'Sour Gas'],
 ['Oil', 'Gas', 'Refined', 'Refined', 'Oil'],
 ['Gas', 'Water', 'Water', 'Oil', 'Gas'],
 list(np.random.randint(10, 100, 5)),
 list(np.random.randint(10, 100, 5))]
 ).T
df.columns =['ID', 'Substance1', 'Substance2', 'Category1', 'Category2', 'Quantity1', 'Quantity2']

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+')
 .groupby(['ID','Category'])['Quantity'].sum()
 .unstack().reset_index()

Output:

Category ID Gas Oil Refined Water
0 A 19.0 54.0 NaN NaN
1 B 57.0 NaN NaN 93.0
2 C NaN NaN 74.0 31.0
3 D NaN 46.0 77.0 NaN
4 E 97.0 77.0 NaN NaN

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

answered Nov 14 '18 at 22:05

Scott Boston

55.7k73156

1

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked: pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

add a comment |

1

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked: pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

This works, but not as is. It was creating a ton of duplicate columns and adding all of the numbers, resulting in inaccurate values. But adding two methods to the chain, reset_index and drop_duplicates, worked:

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

pd.wide_to_long(df,['Substance','Category','Quantity'], 'ID','Num','','.+') .reset_index().drop_duplicates(subset=['ID', 'Num']) .groupby(['ID','Category'])['Quantity'].sum() .unstack().reset_index()

– robroc
Nov 14 '18 at 22:48

I'll add this nifty solution also needs some adaptation for the older pandas 0.19.

– kabanus
Nov 14 '18 at 22:53

add a comment |

Here is my semi-manual approach:

>>> df
 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 74 49
1 B Natural Gas Salt water Gas Water 75 91
2 C Gasoline Waste water Refined Water 24 38
3 D Diesel Motor oil Refined Oil 19 95
4 E Bitumen Sour Gas Oil Gas 50 35
>>> newdf=pd.DataFrame(columns=set(df[['Category1','Category2']].values.flatten()),index=df.index)
>>> for name in newdf: 
 newdf[name]=pd.concat([df[df['Category1']==name]['Quantity1'],df[df['Category2']==name]['Quantity2']])
...
>>> newdf
 Gas Oil Water Refined
0 49 74 NaN NaN
1 75 NaN 91 NaN
2 NaN NaN 38 24
3 NaN 95 NaN 19
4 35 50 NaN NaN

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

add a comment |

Here is my semi-manual approach:

>>> df
 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 74 49
1 B Natural Gas Salt water Gas Water 75 91
2 C Gasoline Waste water Refined Water 24 38
3 D Diesel Motor oil Refined Oil 19 95
4 E Bitumen Sour Gas Oil Gas 50 35
>>> newdf=pd.DataFrame(columns=set(df[['Category1','Category2']].values.flatten()),index=df.index)
>>> for name in newdf: 
 newdf[name]=pd.concat([df[df['Category1']==name]['Quantity1'],df[df['Category2']==name]['Quantity2']])
...
>>> newdf
 Gas Oil Water Refined
0 49 74 NaN NaN
1 75 NaN 91 NaN
2 NaN NaN 38 24
3 NaN 95 NaN 19
4 35 50 NaN NaN

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

add a comment |

Here is my semi-manual approach:

>>> df
 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 74 49
1 B Natural Gas Salt water Gas Water 75 91
2 C Gasoline Waste water Refined Water 24 38
3 D Diesel Motor oil Refined Oil 19 95
4 E Bitumen Sour Gas Oil Gas 50 35
>>> newdf=pd.DataFrame(columns=set(df[['Category1','Category2']].values.flatten()),index=df.index)
>>> for name in newdf: 
 newdf[name]=pd.concat([df[df['Category1']==name]['Quantity1'],df[df['Category2']==name]['Quantity2']])
...
>>> newdf
 Gas Oil Water Refined
0 49 74 NaN NaN
1 75 NaN 91 NaN
2 NaN NaN 38 24
3 NaN 95 NaN 19
4 35 50 NaN NaN

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

Here is my semi-manual approach:

>>> df
 ID Substance1 Substance2 Category1 Category2 Quantity1 Quantity2
0 A Crude Oil Natural Gas Oil Gas 74 49
1 B Natural Gas Salt water Gas Water 75 91
2 C Gasoline Waste water Refined Water 24 38
3 D Diesel Motor oil Refined Oil 19 95
4 E Bitumen Sour Gas Oil Gas 50 35
>>> newdf=pd.DataFrame(columns=set(df[['Category1','Category2']].values.flatten()),index=df.index)
>>> for name in newdf: 
 newdf[name]=pd.concat([df[df['Category1']==name]['Quantity1'],df[df['Category2']==name]['Quantity2']])
...
>>> newdf
 Gas Oil Water Refined
0 49 74 NaN NaN
1 75 NaN 91 NaN
2 NaN NaN 38 24
3 NaN 95 NaN 19
4 35 50 NaN NaN

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

answered Nov 14 '18 at 22:39

kabanus

11.8k31439

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj