Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.

>>> df
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Thought about using something like the following but do not know the best way to write the func.

df.groupby('Type').apply(func)

Thank you in advance.

asked Nov 15 '18 at 7:18

elgnoh

191212

What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26

What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33

Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35

in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44

@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52

add a comment |

>>> df
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Thought about using something like the following but do not know the best way to write the func.

df.groupby('Type').apply(func)

Thank you in advance.

asked Nov 15 '18 at 7:18

elgnoh

191212

What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26

What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33

Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35

in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44

@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52

add a comment |

>>> df
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Thought about using something like the following but do not know the best way to write the func.

df.groupby('Type').apply(func)

Thank you in advance.

asked Nov 15 '18 at 7:18

elgnoh

191212

>>> df
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Thought about using something like the following but do not know the best way to write the func.

df.groupby('Type').apply(func)

Thank you in advance.

python pandas

asked Nov 15 '18 at 7:18

elgnoh

191212

asked Nov 15 '18 at 7:18

elgnoh

191212

asked Nov 15 '18 at 7:18

elgnoh

191212

asked Nov 15 '18 at 7:18

elgnoh

191212

asked Nov 15 '18 at 7:18

elgnoh

191212

What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26

What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33

Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35

in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44

@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52

add a comment |

What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26

What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33

Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35

in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44

@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52

What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26

What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33

Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35

in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44

@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52

add a comment |

1 Answer
1

active

oldest

votes

Use value_counts with map and repeat:

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)[['Type','Val']]

print(df)

 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.

EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:

df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314228%2fduplicate-row-of-low-occurrence-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Use value_counts with map and repeat:

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)[['Type','Val']]

print(df)

 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.

EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:

df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

add a comment |

Use value_counts with map and repeat:

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)[['Type','Val']]

print(df)

 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.

EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:

df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

add a comment |

Use value_counts with map and repeat:

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)[['Type','Val']]

print(df)

 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.

EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:

df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

Use value_counts with map and repeat:

counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)[['Type','Val']]

print(df)

 Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1

Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.

EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:

df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
 sort=False, ignore_index=True)

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

edited Nov 15 '18 at 18:02

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

answered Nov 15 '18 at 7:37

Sandeep Kadapa

7,308831

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

add a comment |

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22

Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44

@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03

@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj