Python DataFrames: finding *almost" identical rows










1















I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.



Problem, there is no unique key that can help me find back which order corresponds to which cancellation.



So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):



for i, item in cancelations.iterrows(): 
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data


So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?



thank you for your time !



edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.










share|improve this question
























  • Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

    – sophros
    Nov 14 '18 at 17:11












  • I never heard about rolling, I'll look into that and see if it works. Thank you !

    – Benjamin Dubreu
    Nov 14 '18 at 17:19






  • 1





    you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

    – Ben.T
    Nov 14 '18 at 17:37
















1















I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.



Problem, there is no unique key that can help me find back which order corresponds to which cancellation.



So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):



for i, item in cancelations.iterrows(): 
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data


So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?



thank you for your time !



edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.










share|improve this question
























  • Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

    – sophros
    Nov 14 '18 at 17:11












  • I never heard about rolling, I'll look into that and see if it works. Thank you !

    – Benjamin Dubreu
    Nov 14 '18 at 17:19






  • 1





    you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

    – Ben.T
    Nov 14 '18 at 17:37














1












1








1


0






I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.



Problem, there is no unique key that can help me find back which order corresponds to which cancellation.



So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):



for i, item in cancelations.iterrows(): 
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data


So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?



thank you for your time !



edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.










share|improve this question
















I have a DF loaded with orders. Some of them contains negative quantities, and the reason for that is that they are actually cancellations of prior orders.



Problem, there is no unique key that can help me find back which order corresponds to which cancellation.



So I've built the following code ('cancelations' is a subset of the original data containing only the rows that correspond to... well... cancelations):



for i, item in cancelations.iterrows(): 
#find a row similar to the cancelation we are currently studying:
#We use item[1] to access second value of the tuple given back by iterrows()
mask1 = (copy['CustomerID'] == item['CustomerID'])
mask2 = (copy['Quantity'] == item['Quantity'])
mask3 = (copy['Description'] == item['Description'])
subset = copy[ mask1 & mask2 & mask3]
if subset.shape[0] >0: #if we find one or several corresponding orders :
print('possible corresponding orders:', subset.index.tolist())
copy = copy.drop(subset.index.tolist()[0]) #retrieve only the first ot them from the copy of the data


So, this works, but :
first, it takes forever to run; and second, I read somewhere that whenever you find yourself writing complex code to manipulate dataframes, there's already a method for it.
So perhaps one of you know something that could help me ?



thank you for your time !



edit : note that sometimes, we can have several orders that could correspond to the cancelation at hand. This is why I didn't use drop_duplicates with only some columns specified... because it eliminates all duplicates (or all but one) : I need to drop only one of them.







python pandas dataframe






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 17:19







Benjamin Dubreu

















asked Nov 14 '18 at 17:08









Benjamin DubreuBenjamin Dubreu

526




526












  • Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

    – sophros
    Nov 14 '18 at 17:11












  • I never heard about rolling, I'll look into that and see if it works. Thank you !

    – Benjamin Dubreu
    Nov 14 '18 at 17:19






  • 1





    you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

    – Ben.T
    Nov 14 '18 at 17:37


















  • Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

    – sophros
    Nov 14 '18 at 17:11












  • I never heard about rolling, I'll look into that and see if it works. Thank you !

    – Benjamin Dubreu
    Nov 14 '18 at 17:19






  • 1





    you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

    – Ben.T
    Nov 14 '18 at 17:37

















Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11






Why won't you sort the orders by these columns and then use sliding window (rolling) to consider pairs of orders at a time?

– sophros
Nov 14 '18 at 17:11














I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19





I never heard about rolling, I'll look into that and see if it works. Thank you !

– Benjamin Dubreu
Nov 14 '18 at 17:19




1




1





you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37






you can probably use merge too. If you give some input and output data (few rows and necessary columns), you will get more answer :)

– Ben.T
Nov 14 '18 at 17:37













0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53305438%2fpython-dataframes-finding-almost-identical-rows%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53305438%2fpython-dataframes-finding-almost-identical-rows%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

What does pagestruct do in Eviews?

Dutch intervention in Lombok and Karangasem

Channel Islands