Python DataFrames: finding *almost" identical rows

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.

Problem, there is no unique key that can help me find back which order corresponds to which cancellation.

So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):

for i, item in cancelations.iterrows(): 
 #find a row similar to the cancelation we are currently studying: 
 #We use item[1] to access second value of the tuple given back by iterrows()
 mask1 = (copy['CustomerID'] == item['CustomerID']) 
 mask2 = (copy['Quantity'] == item['Quantity'])
 mask3 = (copy['Description'] == item['Description']) 
 subset = copy[ mask1 & mask2 & mask3]
 if subset.shape[0] >0: #if we find one or several corresponding orders :
 print('possible corresponding orders:', subset.index.tolist())
 copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data

So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?

thank you for your time !

edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11

I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19

1

you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37

add a comment |

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.

Problem, there is no unique key that can help me find back which order corresponds to which cancellation.

So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):

for i, item in cancelations.iterrows(): 
 #find a row similar to the cancelation we are currently studying: 
 #We use item[1] to access second value of the tuple given back by iterrows()
 mask1 = (copy['CustomerID'] == item['CustomerID']) 
 mask2 = (copy['Quantity'] == item['Quantity'])
 mask3 = (copy['Description'] == item['Description']) 
 subset = copy[ mask1 & mask2 & mask3]
 if subset.shape[0] >0: #if we find one or several corresponding orders :
 print('possible corresponding orders:', subset.index.tolist())
 copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data

thank you for your time !

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11

I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19

1

you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37

add a comment |

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.

Problem, there is no unique key that can help me find back which order corresponds to which cancellation.

So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):

for i, item in cancelations.iterrows(): 
 #find a row similar to the cancelation we are currently studying: 
 #We use item[1] to access second value of the tuple given back by iterrows()
 mask1 = (copy['CustomerID'] == item['CustomerID']) 
 mask2 = (copy['Quantity'] == item['Quantity'])
 mask3 = (copy['Description'] == item['Description']) 
 subset = copy[ mask1 & mask2 & mask3]
 if subset.shape[0] >0: #if we find one or several corresponding orders :
 print('possible corresponding orders:', subset.index.tolist())
 copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data

thank you for your time !

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.

Problem, there is no unique key that can help me find back which order corresponds to which cancellation.

So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):

for i, item in cancelations.iterrows(): 
 #find a row similar to the cancelation we are currently studying: 
 #We use item[1] to access second value of the tuple given back by iterrows()
 mask1 = (copy['CustomerID'] == item['CustomerID']) 
 mask2 = (copy['Quantity'] == item['Quantity'])
 mask3 = (copy['Description'] == item['Description']) 
 subset = copy[ mask1 & mask2 & mask3]
 if subset.shape[0] >0: #if we find one or several corresponding orders :
 print('possible corresponding orders:', subset.index.tolist())
 copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data

thank you for your time !

python pandas dataframe

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

edited Nov 14 '18 at 17:19

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

asked Nov 14 '18 at 17:08

Benjamin Dubreu

526

Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11

I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19

1

you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37

add a comment |

Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11

I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19

1

you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37

Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11

I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19

you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53305438%2fpython-dataframes-finding-almost-identical-rows%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj