Duplicate row of low occurrence in pandas dataframe










0















In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.



>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Thought about using something like the following but do not know the best way to write the func.



df.groupby('Type').apply(func)


Thank you in advance.










share|improve this question






















  • What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

    – jezrael
    Nov 15 '18 at 7:26











  • What is the output if the group contains count 2 which value need to duplicate in this case?

    – Sandeep Kadapa
    Nov 15 '18 at 7:33











  • Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

    – elgnoh
    Nov 15 '18 at 7:35











  • in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

    – elgnoh
    Nov 15 '18 at 7:44











  • @elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

    – Sandeep Kadapa
    Nov 15 '18 at 7:52















0















In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.



>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Thought about using something like the following but do not know the best way to write the func.



df.groupby('Type').apply(func)


Thank you in advance.










share|improve this question






















  • What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

    – jezrael
    Nov 15 '18 at 7:26











  • What is the output if the group contains count 2 which value need to duplicate in this case?

    – Sandeep Kadapa
    Nov 15 '18 at 7:33











  • Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

    – elgnoh
    Nov 15 '18 at 7:35











  • in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

    – elgnoh
    Nov 15 '18 at 7:44











  • @elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

    – Sandeep Kadapa
    Nov 15 '18 at 7:52













0












0








0








In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.



>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Thought about using something like the following but do not know the best way to write the func.



df.groupby('Type').apply(func)


Thank you in advance.










share|improve this question














In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.



>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Thought about using something like the following but do not know the best way to write the func.



df.groupby('Type').apply(func)


Thank you in advance.







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 7:18









elgnohelgnoh

191212




191212












  • What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

    – jezrael
    Nov 15 '18 at 7:26











  • What is the output if the group contains count 2 which value need to duplicate in this case?

    – Sandeep Kadapa
    Nov 15 '18 at 7:33











  • Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

    – elgnoh
    Nov 15 '18 at 7:35











  • in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

    – elgnoh
    Nov 15 '18 at 7:44











  • @elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

    – Sandeep Kadapa
    Nov 15 '18 at 7:52

















  • What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

    – jezrael
    Nov 15 '18 at 7:26











  • What is the output if the group contains count 2 which value need to duplicate in this case?

    – Sandeep Kadapa
    Nov 15 '18 at 7:33











  • Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

    – elgnoh
    Nov 15 '18 at 7:35











  • in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

    – elgnoh
    Nov 15 '18 at 7:44











  • @elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

    – Sandeep Kadapa
    Nov 15 '18 at 7:52
















What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26





What is append if removed 6 c 1 ? Last value per group c ? there is more columns or only 2 ?

– jezrael
Nov 15 '18 at 7:26













What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33





What is the output if the group contains count 2 which value need to duplicate in this case?

– Sandeep Kadapa
Nov 15 '18 at 7:33













Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35





Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.

– elgnoh
Nov 15 '18 at 7:35













in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44





in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.

– elgnoh
Nov 15 '18 at 7:44













@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52





@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?

– Sandeep Kadapa
Nov 15 '18 at 7:52












1 Answer
1






active

oldest

votes


















0














Use value_counts with map and repeat:



counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]

print(df)

Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.



EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:



df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
sort=False, ignore_index=True)





share|improve this answer

























  • Better you use sort=False to sort=True or vice versa as require rather using lower version

    – pygo
    Nov 15 '18 at 8:22












  • Thank you for the answer. would this work if my dataset has more than 1 Val column?

    – elgnoh
    Nov 15 '18 at 17:44











  • @elgnoh It works for more than 1 val check EDIT.

    – Sandeep Kadapa
    Nov 15 '18 at 18:03











  • @elgnoh Please upvote and accept if the solution provided your required result.

    – Sandeep Kadapa
    Nov 23 '18 at 5:47










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314228%2fduplicate-row-of-low-occurrence-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Use value_counts with map and repeat:



counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]

print(df)

Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.



EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:



df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
sort=False, ignore_index=True)





share|improve this answer

























  • Better you use sort=False to sort=True or vice versa as require rather using lower version

    – pygo
    Nov 15 '18 at 8:22












  • Thank you for the answer. would this work if my dataset has more than 1 Val column?

    – elgnoh
    Nov 15 '18 at 17:44











  • @elgnoh It works for more than 1 val check EDIT.

    – Sandeep Kadapa
    Nov 15 '18 at 18:03











  • @elgnoh Please upvote and accept if the solution provided your required result.

    – Sandeep Kadapa
    Nov 23 '18 at 5:47















0














Use value_counts with map and repeat:



counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]

print(df)

Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.



EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:



df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
sort=False, ignore_index=True)





share|improve this answer

























  • Better you use sort=False to sort=True or vice versa as require rather using lower version

    – pygo
    Nov 15 '18 at 8:22












  • Thank you for the answer. would this work if my dataset has more than 1 Val column?

    – elgnoh
    Nov 15 '18 at 17:44











  • @elgnoh It works for more than 1 val check EDIT.

    – Sandeep Kadapa
    Nov 15 '18 at 18:03











  • @elgnoh Please upvote and accept if the solution provided your required result.

    – Sandeep Kadapa
    Nov 23 '18 at 5:47













0












0








0







Use value_counts with map and repeat:



counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]

print(df)

Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.



EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:



df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
sort=False, ignore_index=True)





share|improve this answer















Use value_counts with map and repeat:



counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]

print(df)

Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1


Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.



EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:



df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(), 
sort=False, ignore_index=True)






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 15 '18 at 18:02

























answered Nov 15 '18 at 7:37









Sandeep KadapaSandeep Kadapa

7,308831




7,308831












  • Better you use sort=False to sort=True or vice versa as require rather using lower version

    – pygo
    Nov 15 '18 at 8:22












  • Thank you for the answer. would this work if my dataset has more than 1 Val column?

    – elgnoh
    Nov 15 '18 at 17:44











  • @elgnoh It works for more than 1 val check EDIT.

    – Sandeep Kadapa
    Nov 15 '18 at 18:03











  • @elgnoh Please upvote and accept if the solution provided your required result.

    – Sandeep Kadapa
    Nov 23 '18 at 5:47

















  • Better you use sort=False to sort=True or vice versa as require rather using lower version

    – pygo
    Nov 15 '18 at 8:22












  • Thank you for the answer. would this work if my dataset has more than 1 Val column?

    – elgnoh
    Nov 15 '18 at 17:44











  • @elgnoh It works for more than 1 val check EDIT.

    – Sandeep Kadapa
    Nov 15 '18 at 18:03











  • @elgnoh Please upvote and accept if the solution provided your required result.

    – Sandeep Kadapa
    Nov 23 '18 at 5:47
















Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22






Better you use sort=False to sort=True or vice versa as require rather using lower version

– pygo
Nov 15 '18 at 8:22














Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44





Thank you for the answer. would this work if my dataset has more than 1 Val column?

– elgnoh
Nov 15 '18 at 17:44













@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03





@elgnoh It works for more than 1 val check EDIT.

– Sandeep Kadapa
Nov 15 '18 at 18:03













@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47





@elgnoh Please upvote and accept if the solution provided your required result.

– Sandeep Kadapa
Nov 23 '18 at 5:47



















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314228%2fduplicate-row-of-low-occurrence-in-pandas-dataframe%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

What does pagestruct do in Eviews?

Dutch intervention in Lombok and Karangasem

Channel Islands