Pandas: check whether at least one of values in duplicates' rows is 1
This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:
asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])
The resulting table looks like this:
I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1
What I am trying to do is to:
if at least one "a" value in Col1
has a corresponding value of 1
in Col2
, then in Col3
we put 1
for all values of "a"
otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"
And then repeat for all other values in Col1
.
The result of the operation should look like this:
I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1
Currently I am doing this:
asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes =
for i in col1_uniques:
small_df = asd.loc[asd.Col1 == i]
if small_df.Col2.max() == 1:
small_df['Col3'] = 1
small_dataframes.append(small_df)
I then reassemble the dataframe back.
However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.
Is there a better way to do it?
python pandas
add a comment |
This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:
asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])
The resulting table looks like this:
I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1
What I am trying to do is to:
if at least one "a" value in Col1
has a corresponding value of 1
in Col2
, then in Col3
we put 1
for all values of "a"
otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"
And then repeat for all other values in Col1
.
The result of the operation should look like this:
I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1
Currently I am doing this:
asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes =
for i in col1_uniques:
small_df = asd.loc[asd.Col1 == i]
if small_df.Col2.max() == 1:
small_df['Col3'] = 1
small_dataframes.append(small_df)
I then reassemble the dataframe back.
However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.
Is there a better way to do it?
python pandas
add a comment |
This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:
asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])
The resulting table looks like this:
I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1
What I am trying to do is to:
if at least one "a" value in Col1
has a corresponding value of 1
in Col2
, then in Col3
we put 1
for all values of "a"
otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"
And then repeat for all other values in Col1
.
The result of the operation should look like this:
I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1
Currently I am doing this:
asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes =
for i in col1_uniques:
small_df = asd.loc[asd.Col1 == i]
if small_df.Col2.max() == 1:
small_df['Col3'] = 1
small_dataframes.append(small_df)
I then reassemble the dataframe back.
However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.
Is there a better way to do it?
python pandas
This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:
asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])
The resulting table looks like this:
I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1
What I am trying to do is to:
if at least one "a" value in Col1
has a corresponding value of 1
in Col2
, then in Col3
we put 1
for all values of "a"
otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"
And then repeat for all other values in Col1
.
The result of the operation should look like this:
I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1
Currently I am doing this:
asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes =
for i in col1_uniques:
small_df = asd.loc[asd.Col1 == i]
if small_df.Col2.max() == 1:
small_df['Col3'] = 1
small_dataframes.append(small_df)
I then reassemble the dataframe back.
However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.
Is there a better way to do it?
python pandas
python pandas
edited Nov 15 '18 at 5:32
Askar Akhmedov
asked Nov 15 '18 at 4:09
Askar AkhmedovAskar Akhmedov
405
405
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
Another method without groupby
and faster using np.where
and isin
:
v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)
print(asd)
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
add a comment |
My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,
asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Option 2: Similar solution as above but using map
d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
add a comment |
You can do this with a groupby and an if statement. First group all items by Col1:
lists = asd.groupby("Col1").agg(lambda x: tuple(x))
This gives you:
Col2
Col1
a (0, 1, 1)
b (0, 0)
You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].
asd["Col3"] = 0
for i in lists.index:
if 1 in lists.loc[i, "Col2"]:
asd.loc[asd["Col1"]==i, "Col3"] = 1
This results in:
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312302%2fpandas-check-whether-at-least-one-of-values-in-duplicates-rows-is-1%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Another method without groupby
and faster using np.where
and isin
:
v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)
print(asd)
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
add a comment |
Another method without groupby
and faster using np.where
and isin
:
v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)
print(asd)
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
add a comment |
Another method without groupby
and faster using np.where
and isin
:
v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)
print(asd)
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Another method without groupby
and faster using np.where
and isin
:
v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)
print(asd)
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
answered Nov 15 '18 at 5:10
Sandeep KadapaSandeep Kadapa
7,248831
7,248831
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
add a comment |
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
1
1
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
By far the fastest method and the most intuitive. Thanks a lot.
– Askar Akhmedov
Nov 15 '18 at 5:33
add a comment |
My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,
asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Option 2: Similar solution as above but using map
d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
add a comment |
My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,
asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Option 2: Similar solution as above but using map
d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
add a comment |
My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,
asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Option 2: Similar solution as above but using map
d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)
My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,
asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Option 2: Similar solution as above but using map
d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)
edited Nov 15 '18 at 4:47
answered Nov 15 '18 at 4:42
VaishaliVaishali
21.7k41336
21.7k41336
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
add a comment |
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.
– Askar Akhmedov
Nov 15 '18 at 5:01
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.
– Vaishali
Nov 15 '18 at 5:03
add a comment |
You can do this with a groupby and an if statement. First group all items by Col1:
lists = asd.groupby("Col1").agg(lambda x: tuple(x))
This gives you:
Col2
Col1
a (0, 1, 1)
b (0, 0)
You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].
asd["Col3"] = 0
for i in lists.index:
if 1 in lists.loc[i, "Col2"]:
asd.loc[asd["Col1"]==i, "Col3"] = 1
This results in:
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
add a comment |
You can do this with a groupby and an if statement. First group all items by Col1:
lists = asd.groupby("Col1").agg(lambda x: tuple(x))
This gives you:
Col2
Col1
a (0, 1, 1)
b (0, 0)
You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].
asd["Col3"] = 0
for i in lists.index:
if 1 in lists.loc[i, "Col2"]:
asd.loc[asd["Col1"]==i, "Col3"] = 1
This results in:
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
add a comment |
You can do this with a groupby and an if statement. First group all items by Col1:
lists = asd.groupby("Col1").agg(lambda x: tuple(x))
This gives you:
Col2
Col1
a (0, 1, 1)
b (0, 0)
You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].
asd["Col3"] = 0
for i in lists.index:
if 1 in lists.loc[i, "Col2"]:
asd.loc[asd["Col1"]==i, "Col3"] = 1
This results in:
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
You can do this with a groupby and an if statement. First group all items by Col1:
lists = asd.groupby("Col1").agg(lambda x: tuple(x))
This gives you:
Col2
Col1
a (0, 1, 1)
b (0, 0)
You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].
asd["Col3"] = 0
for i in lists.index:
if 1 in lists.loc[i, "Col2"]:
asd.loc[asd["Col1"]==i, "Col3"] = 1
This results in:
Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1
answered Nov 15 '18 at 4:30
jake.nunemakerjake.nunemaker
11
11
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
add a comment |
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.
– jake.nunemaker
Nov 15 '18 at 4:33
1
1
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
you can use all np functions in pandas, as pandas is built on numpy
– d_kennetz
Nov 15 '18 at 4:35
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
In general, if you're writing loops for dataframes, there's a better way :)
– Evan
Nov 15 '18 at 4:39
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312302%2fpandas-check-whether-at-least-one-of-values-in-duplicates-rows-is-1%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown