Fuzzy matching a sorted column with itself using python









up vote
0
down vote

favorite
1












I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.



Example:



Output



In the above example the third row is very similar to the first row so I want it to take the value of the first row.



I have a working python code but it is too slow:



import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray

data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")

# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)

# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
start = time.time()
for index,row in data.iterrows():
if index%5000==0:print(index, time.time()-start)
for index2, row2 in lookup_data.iterrows():
ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
if ratio_val<0.15:
data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
data.set_value(index,'Ratio_Val',ratio_val)
break


Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?










share|improve this question



























    up vote
    0
    down vote

    favorite
    1












    I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.



    Example:



    Output



    In the above example the third row is very similar to the first row so I want it to take the value of the first row.



    I have a working python code but it is too slow:



    import pandas as pd
    import datetime
    import time
    import numpy as np
    from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray

    data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")

    # Create lookup column from the dataframe itself:
    lookup_data=data['UNIQUE_ID']
    lookup_data=pd.Series.to_frame(lookup_data)

    # Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
    start = time.time()
    for index,row in data.iterrows():
    if index%5000==0:print(index, time.time()-start)
    for index2, row2 in lookup_data.iterrows():
    ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
    if ratio_val<0.15:
    data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
    data.set_value(index,'Ratio_Val',ratio_val)
    break


    Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?










    share|improve this question

























      up vote
      0
      down vote

      favorite
      1









      up vote
      0
      down vote

      favorite
      1






      1





      I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.



      Example:



      Output



      In the above example the third row is very similar to the first row so I want it to take the value of the first row.



      I have a working python code but it is too slow:



      import pandas as pd
      import datetime
      import time
      import numpy as np
      from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray

      data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")

      # Create lookup column from the dataframe itself:
      lookup_data=data['UNIQUE_ID']
      lookup_data=pd.Series.to_frame(lookup_data)

      # Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
      start = time.time()
      for index,row in data.iterrows():
      if index%5000==0:print(index, time.time()-start)
      for index2, row2 in lookup_data.iterrows():
      ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
      if ratio_val<0.15:
      data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
      data.set_value(index,'Ratio_Val',ratio_val)
      break


      Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?










      share|improve this question















      I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.



      Example:



      Output



      In the above example the third row is very similar to the first row so I want it to take the value of the first row.



      I have a working python code but it is too slow:



      import pandas as pd
      import datetime
      import time
      import numpy as np
      from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray

      data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")

      # Create lookup column from the dataframe itself:
      lookup_data=data['UNIQUE_ID']
      lookup_data=pd.Series.to_frame(lookup_data)

      # Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
      start = time.time()
      for index,row in data.iterrows():
      if index%5000==0:print(index, time.time()-start)
      for index2, row2 in lookup_data.iterrows():
      ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
      if ratio_val<0.15:
      data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
      data.set_value(index,'Ratio_Val',ratio_val)
      break


      Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?







      python






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 3 '16 at 19:22









      user7245889

      175




      175










      asked Dec 3 '16 at 18:20









      D.S

      1




      1






















          2 Answers
          2






          active

          oldest

          votes

















          up vote
          1
          down vote













          One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:



          if index2 <= index:
          continue


          This alone will speed up the matching by the factor of 2.






          share|improve this answer



























            up vote
            0
            down vote













            I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.



            You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())



            >>> from e_utils.utils import clean_df
            >>> from e_utils.utils import homog_lev

            >>> series
            0 Bad Bunny
            1 bad buny
            2 bag bunny
            3 Ozuna
            4 De La Ghetto
            5 de la geto
            6 Daddy Yankee
            7 dade yankee
            8 Nicky Jam
            9 nicky jam
            10 J Balvin
            11 jbalvin
            12 Maluma
            13 maluma
            14 Anuel AA

            >>> series2 = clean_df(series)
            >>> series2 = homog_lev(series2, eps=3)
            >>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

            *Original* *Fixed*
            0 Bad Bunny Bad Bunny
            1 bad buny Bad Bunny
            2 bag bunny Bad Bunny
            3 Ozuna Ozuna
            4 De La Ghetto De La Ghetto
            5 de la geto De La Ghetto
            6 Daddy Yankee Daddy Yankee
            7 dade yankee Daddy Yankee
            8 Nicky Jam Nicky Jam
            9 nicky jam Nicky Jam
            10 J Balvin J Balvin
            11 jbalvin J Balvin
            12 Maluma Maluma
            13 maluma Maluma
            14 Anuel AA Anuel Aa





            share|improve this answer






















              Your Answer






              StackExchange.ifUsing("editor", function ()
              StackExchange.using("externalEditor", function ()
              StackExchange.using("snippets", function ()
              StackExchange.snippets.init();
              );
              );
              , "code-snippets");

              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "1"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: true,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: 10,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );













              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40951029%2ffuzzy-matching-a-sorted-column-with-itself-using-python%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes








              up vote
              1
              down vote













              One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:



              if index2 <= index:
              continue


              This alone will speed up the matching by the factor of 2.






              share|improve this answer
























                up vote
                1
                down vote













                One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:



                if index2 <= index:
                continue


                This alone will speed up the matching by the factor of 2.






                share|improve this answer






















                  up vote
                  1
                  down vote










                  up vote
                  1
                  down vote









                  One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:



                  if index2 <= index:
                  continue


                  This alone will speed up the matching by the factor of 2.






                  share|improve this answer












                  One immediate suggestion: Since matching is symmetric, you need to match each row only to the rows that have not been matched yet. Rewrite the inner loop to skip over the previously visited rows. E.g., add this:



                  if index2 <= index:
                  continue


                  This alone will speed up the matching by the factor of 2.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Dec 3 '16 at 18:26









                  DYZ

                  25.2k61948




                  25.2k61948






















                      up vote
                      0
                      down vote













                      I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.



                      You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())



                      >>> from e_utils.utils import clean_df
                      >>> from e_utils.utils import homog_lev

                      >>> series
                      0 Bad Bunny
                      1 bad buny
                      2 bag bunny
                      3 Ozuna
                      4 De La Ghetto
                      5 de la geto
                      6 Daddy Yankee
                      7 dade yankee
                      8 Nicky Jam
                      9 nicky jam
                      10 J Balvin
                      11 jbalvin
                      12 Maluma
                      13 maluma
                      14 Anuel AA

                      >>> series2 = clean_df(series)
                      >>> series2 = homog_lev(series2, eps=3)
                      >>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

                      *Original* *Fixed*
                      0 Bad Bunny Bad Bunny
                      1 bad buny Bad Bunny
                      2 bag bunny Bad Bunny
                      3 Ozuna Ozuna
                      4 De La Ghetto De La Ghetto
                      5 de la geto De La Ghetto
                      6 Daddy Yankee Daddy Yankee
                      7 dade yankee Daddy Yankee
                      8 Nicky Jam Nicky Jam
                      9 nicky jam Nicky Jam
                      10 J Balvin J Balvin
                      11 jbalvin J Balvin
                      12 Maluma Maluma
                      13 maluma Maluma
                      14 Anuel AA Anuel Aa





                      share|improve this answer


























                        up vote
                        0
                        down vote













                        I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.



                        You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())



                        >>> from e_utils.utils import clean_df
                        >>> from e_utils.utils import homog_lev

                        >>> series
                        0 Bad Bunny
                        1 bad buny
                        2 bag bunny
                        3 Ozuna
                        4 De La Ghetto
                        5 de la geto
                        6 Daddy Yankee
                        7 dade yankee
                        8 Nicky Jam
                        9 nicky jam
                        10 J Balvin
                        11 jbalvin
                        12 Maluma
                        13 maluma
                        14 Anuel AA

                        >>> series2 = clean_df(series)
                        >>> series2 = homog_lev(series2, eps=3)
                        >>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

                        *Original* *Fixed*
                        0 Bad Bunny Bad Bunny
                        1 bad buny Bad Bunny
                        2 bag bunny Bad Bunny
                        3 Ozuna Ozuna
                        4 De La Ghetto De La Ghetto
                        5 de la geto De La Ghetto
                        6 Daddy Yankee Daddy Yankee
                        7 dade yankee Daddy Yankee
                        8 Nicky Jam Nicky Jam
                        9 nicky jam Nicky Jam
                        10 J Balvin J Balvin
                        11 jbalvin J Balvin
                        12 Maluma Maluma
                        13 maluma Maluma
                        14 Anuel AA Anuel Aa





                        share|improve this answer
























                          up vote
                          0
                          down vote










                          up vote
                          0
                          down vote









                          I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.



                          You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())



                          >>> from e_utils.utils import clean_df
                          >>> from e_utils.utils import homog_lev

                          >>> series
                          0 Bad Bunny
                          1 bad buny
                          2 bag bunny
                          3 Ozuna
                          4 De La Ghetto
                          5 de la geto
                          6 Daddy Yankee
                          7 dade yankee
                          8 Nicky Jam
                          9 nicky jam
                          10 J Balvin
                          11 jbalvin
                          12 Maluma
                          13 maluma
                          14 Anuel AA

                          >>> series2 = clean_df(series)
                          >>> series2 = homog_lev(series2, eps=3)
                          >>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

                          *Original* *Fixed*
                          0 Bad Bunny Bad Bunny
                          1 bad buny Bad Bunny
                          2 bag bunny Bad Bunny
                          3 Ozuna Ozuna
                          4 De La Ghetto De La Ghetto
                          5 de la geto De La Ghetto
                          6 Daddy Yankee Daddy Yankee
                          7 dade yankee Daddy Yankee
                          8 Nicky Jam Nicky Jam
                          9 nicky jam Nicky Jam
                          10 J Balvin J Balvin
                          11 jbalvin J Balvin
                          12 Maluma Maluma
                          13 maluma Maluma
                          14 Anuel AA Anuel Aa





                          share|improve this answer














                          I had the same issue and resolved it with a combination of the levenshtein package (to create a distance matrix) and scikit's DBSCAN to cluster similar strings and to assing the same value to every element within the cluster.



                          You can check it out here: https://github.com/ebravofm/e_utils (homog_lev())



                          >>> from e_utils.utils import clean_df
                          >>> from e_utils.utils import homog_lev

                          >>> series
                          0 Bad Bunny
                          1 bad buny
                          2 bag bunny
                          3 Ozuna
                          4 De La Ghetto
                          5 de la geto
                          6 Daddy Yankee
                          7 dade yankee
                          8 Nicky Jam
                          9 nicky jam
                          10 J Balvin
                          11 jbalvin
                          12 Maluma
                          13 maluma
                          14 Anuel AA

                          >>> series2 = clean_df(series)
                          >>> series2 = homog_lev(series2, eps=3)
                          >>> pd.concat([series, series2.str.title()], axis=1, keys=['*Original*', '*Fixed*'])

                          *Original* *Fixed*
                          0 Bad Bunny Bad Bunny
                          1 bad buny Bad Bunny
                          2 bag bunny Bad Bunny
                          3 Ozuna Ozuna
                          4 De La Ghetto De La Ghetto
                          5 de la geto De La Ghetto
                          6 Daddy Yankee Daddy Yankee
                          7 dade yankee Daddy Yankee
                          8 Nicky Jam Nicky Jam
                          9 nicky jam Nicky Jam
                          10 J Balvin J Balvin
                          11 jbalvin J Balvin
                          12 Maluma Maluma
                          13 maluma Maluma
                          14 Anuel AA Anuel Aa






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Nov 12 at 3:30









                          Stephen Rauch

                          27.6k153256




                          27.6k153256










                          answered Nov 12 at 3:11









                          ebravo

                          2018




                          2018



























                              draft saved

                              draft discarded
















































                              Thanks for contributing an answer to Stack Overflow!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.





                              Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                              Please pay close attention to the following guidance:


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f40951029%2ffuzzy-matching-a-sorted-column-with-itself-using-python%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              這個網誌中的熱門文章

                              How to read a connectionString WITH PROVIDER in .NET Core?

                              In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                              Museum of Modern and Contemporary Art of Trento and Rovereto