Pandas: check whether at least one of values in duplicates' rows is 1

This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:

asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1

What I am trying to do is to:

if at least one "a" value in Col1 has a corresponding value of 1 in Col2, then in Col3 we put 1 for all values of "a"

otherwise (if not even one "a" has a value of 1), then we put "0" for all values of "a"

And then repeat for all other values in Col1.

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = 

for i in col1_uniques:
 small_df = asd.loc[asd.Col1 == i]
 if small_df.Col2.max() == 1:
 small_df['Col3'] = 1

 small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

add a comment |

This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:

asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = 

for i in col1_uniques:
 small_df = asd.loc[asd.Col1 == i]
 if small_df.Col2.max() == 1:
 small_df['Col3'] = 1

 small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

add a comment |

This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:

asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = 

for i in col1_uniques:
 small_df = asd.loc[asd.Col1 == i]
 if small_df.Col2.max() == 1:
 small_df['Col3'] = 1

 small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

This problem may be rather specific, but I bet many may encounter this as well.
So I have a DataFrame in a form like:

asd = pd.DataFrame('Col1': ['a', 'b', 'b','a','a'], 'Col2': [0,0,0,1,1])

The resulting table looks like this:

I -- Col1 -- Col2
1 -- a -- 0
2 -- b -- 0
3 -- b -- 0
4 -- a -- 1
5 -- a -- 1

The result of the operation should look like this:

I -- Col1 -- Col2 -- Col3
1 -- a -- 0 -- 1 because "a" has value of 1 in 4th and 5th lines
2 -- b -- 0 -- 0 because all "b" have values of 0
3 -- b -- 0 -- 0
4 -- a -- 1 -- 1
5 -- a -- 1 -- 1

Currently I am doing this:

asd['Col3'] = 0
col1_uniques = asd.drop_duplicates(subset='Col1')['Col1']
small_dataframes = 

for i in col1_uniques:
 small_df = asd.loc[asd.Col1 == i]
 if small_df.Col2.max() == 1:
 small_df['Col3'] = 1

 small_dataframes.append(small_df)

I then reassemble the dataframe back.

However, that takes too much time (I have about 80000 unique values in Col1). In fact, while I was writing this, it hasn't finished even a quarter of that job.

Is there a better way to do it?

python pandas

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

edited Nov 15 '18 at 5:32

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

asked Nov 15 '18 at 4:09

Askar Akhmedov

405

add a comment |

3 Answers
3

active

oldest

votes

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

1

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

add a comment |

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

add a comment |

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

 Col2
Col1 
a (0, 1, 1)
b (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
 if 1 in lists.loc[i, "Col2"]:
 asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 4:30

jake.nunemaker

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

1

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312302%2fpandas-check-whether-at-least-one-of-values-in-duplicates-rows-is-1%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

1

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

add a comment |

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

1

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

add a comment |

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

Another method without groupby and faster using np.where and isin:

v = asd.loc[asd['Col2'].eq(1), 'Col1'].unique()
asd['Col3'] = np.where(asd['Col1'].isin(v), 1, 0)

print(asd)
 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

answered Nov 15 '18 at 5:10

Sandeep Kadapa

7,248831

1

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

add a comment |

1

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

By far the fastest method and the most intuitive. Thanks a lot.

– Askar Akhmedov
Nov 15 '18 at 5:33

add a comment |

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

add a comment |

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

add a comment |

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

My understanding is that you need to repeat the process for all unique values in Col1, you will need groupby,

asd['Col3'] = asd.groupby('Col1').Col2.transform(lambda x: x.eq(1).any().astype(int))

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

Option 2: Similar solution as above but using map

d = asd.groupby('Col1').Col2.apply(lambda x: x.eq(1).any().astype(int)).to_dict()
asd['Col3'] = asd['Col1'].map(d)

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

edited Nov 15 '18 at 4:47

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

answered Nov 15 '18 at 4:42

Vaishali

21.7k41336

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

add a comment |

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

Thank you! Both methods worked and finished in 34 and 24 seconds, respectively.

– Askar Akhmedov
Nov 15 '18 at 5:01

@AskarAkhmedov, thats great. The second solution works faster as it has to do the grouping only once per unique value in Col1.

– Vaishali
Nov 15 '18 at 5:03

add a comment |

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

 Col2
Col1 
a (0, 1, 1)
b (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
 if 1 in lists.loc[i, "Col2"]:
 asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 4:30

jake.nunemaker

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

1

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

add a comment |

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

 Col2
Col1 
a (0, 1, 1)
b (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
 if 1 in lists.loc[i, "Col2"]:
 asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 4:30

jake.nunemaker

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

1

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

add a comment |

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

 Col2
Col1 
a (0, 1, 1)
b (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
 if 1 in lists.loc[i, "Col2"]:
 asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 4:30

jake.nunemaker

You can do this with a groupby and an if statement. First group all items by Col1:

lists = asd.groupby("Col1").agg(lambda x: tuple(x))

This gives you:

 Col2
Col1 
a (0, 1, 1)
b (0, 0)

You can then iterate through the unique index values in lists, masking the original DataFrame and setting Col3 to 1 if a 1 is found in lists["Col2"].

asd["Col3"] = 0
for i in lists.index:
 if 1 in lists.loc[i, "Col2"]:
 asd.loc[asd["Col1"]==i, "Col3"] = 1

This results in:

 Col1 Col2 Col3
0 a 0 1
1 b 0 0
2 b 0 0
3 a 1 1
4 a 1 1

answered Nov 15 '18 at 4:30

jake.nunemaker

answered Nov 15 '18 at 4:30

jake.nunemaker

answered Nov 15 '18 at 4:30

jake.nunemaker

answered Nov 15 '18 at 4:30

jake.nunemaker

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

1

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

add a comment |

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

1

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

Abhi's answer is better, and likely much faster. I also didn't realize you could use np.where within Pandas.

– jake.nunemaker
Nov 15 '18 at 4:33

you can use all np functions in pandas, as pandas is built on numpy

– d_kennetz
Nov 15 '18 at 4:35

In general, if you're writing loops for dataframes, there's a better way :)

– Evan
Nov 15 '18 at 4:39

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj