Duplicate row of low occurrence in pandas dataframe
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
python pandas
add a comment |
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
python pandas
What is append if removed6 c 1? Last value per groupc? there is more columns or only 2 ?
– jezrael
Nov 15 '18 at 7:26
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52
add a comment |
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
python pandas
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
python pandas
python pandas
asked Nov 15 '18 at 7:18
elgnohelgnoh
191212
191212
What is append if removed6 c 1? Last value per groupc? there is more columns or only 2 ?
– jezrael
Nov 15 '18 at 7:26
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52
add a comment |
What is append if removed6 c 1? Last value per groupc? there is more columns or only 2 ?
– jezrael
Nov 15 '18 at 7:26
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52
What is append if removed
6 c 1 ? Last value per group c ? there is more columns or only 2 ?– jezrael
Nov 15 '18 at 7:26
What is append if removed
6 c 1 ? Last value per group c ? there is more columns or only 2 ?– jezrael
Nov 15 '18 at 7:26
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52
add a comment |
1 Answer
1
active
oldest
votes
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Better you usesort=Falsetosort=Trueor vice versa as require rather using lower version
– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Pleaseupvote and acceptif the solution provided your required result.
– Sandeep Kadapa
Nov 23 '18 at 5:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314228%2fduplicate-row-of-low-occurrence-in-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Better you usesort=Falsetosort=Trueor vice versa as require rather using lower version
– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Pleaseupvote and acceptif the solution provided your required result.
– Sandeep Kadapa
Nov 23 '18 at 5:47
add a comment |
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Better you usesort=Falsetosort=Trueor vice versa as require rather using lower version
– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Pleaseupvote and acceptif the solution provided your required result.
– Sandeep Kadapa
Nov 23 '18 at 5:47
add a comment |
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
edited Nov 15 '18 at 18:02
answered Nov 15 '18 at 7:37
Sandeep KadapaSandeep Kadapa
7,308831
7,308831
Better you usesort=Falsetosort=Trueor vice versa as require rather using lower version
– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Pleaseupvote and acceptif the solution provided your required result.
– Sandeep Kadapa
Nov 23 '18 at 5:47
add a comment |
Better you usesort=Falsetosort=Trueor vice versa as require rather using lower version
– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Pleaseupvote and acceptif the solution provided your required result.
– Sandeep Kadapa
Nov 23 '18 at 5:47
Better you use
sort=False to sort=True or vice versa as require rather using lower version– pygo
Nov 15 '18 at 8:22
Better you use
sort=False to sort=True or vice versa as require rather using lower version– pygo
Nov 15 '18 at 8:22
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
Thank you for the answer. would this work if my dataset has more than 1 Val column?
– elgnoh
Nov 15 '18 at 17:44
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh It works for more than 1 val check EDIT.
– Sandeep Kadapa
Nov 15 '18 at 18:03
@elgnoh Please
upvote and accept if the solution provided your required result.– Sandeep Kadapa
Nov 23 '18 at 5:47
@elgnoh Please
upvote and accept if the solution provided your required result.– Sandeep Kadapa
Nov 23 '18 at 5:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53314228%2fduplicate-row-of-low-occurrence-in-pandas-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What is append if removed
6 c 1? Last value per groupc? there is more columns or only 2 ?– jezrael
Nov 15 '18 at 7:26
What is the output if the group contains count 2 which value need to duplicate in this case?
– Sandeep Kadapa
Nov 15 '18 at 7:33
Good question Sandeep, for now let's assume if a group has a count 2, we duplicate both thus ending up with total 4 rows.
– elgnoh
Nov 15 '18 at 7:35
in my real case, the threshold for duplication is around 1000 and those low count group's count can vary from 1-999, idealy I want to duplicate ceil(1000/grpCnt)-1 times then fill the remaining gap to 1000 with randomly sampled from the group.
– elgnoh
Nov 15 '18 at 7:44
@elgnoh You mean say that overall number of duplications cannot be greater than 1000, but each group should contains atleast 3 values?
– Sandeep Kadapa
Nov 15 '18 at 7:52