How do I create a new column in pandas from the difference of two string columns?










2















How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



I've tried doing:



import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']


but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



I've also tried:



data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



Any help would be appreciated.



Thanks










share|improve this question




























    2















    How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



    I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



    I've tried doing:



    import pandas as pd
    data = pd.read_csv("AddressFile.csv")
    data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
    data['Address Difference']


    but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



    I've also tried:



    data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


    but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



    Any help would be appreciated.



    Thanks










    share|improve this question


























      2












      2








      2








      How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



      I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



      I've tried doing:



      import pandas as pd
      data = pd.read_csv("AddressFile.csv")
      data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
      data['Address Difference']


      but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



      I've also tried:



      data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


      but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



      Any help would be appreciated.



      Thanks










      share|improve this question
















      How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?



      I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".



      I've tried doing:



      import pandas as pd
      data = pd.read_csv("AddressFile.csv")
      data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
      data['Address Difference']


      but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).



      I've also tried:



      data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')


      but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.



      Any help would be appreciated.



      Thanks







      python regex pandas






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '18 at 19:30









      Vaishali

      19.4k41030




      19.4k41030










      asked Nov 13 '18 at 20:19









      L. TaylorL. Taylor

      112




      112






















          3 Answers
          3






          active

          oldest

          votes


















          3














          Using replace with regex



          data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





          share|improve this answer























          • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

            – L. Taylor
            Nov 13 '18 at 20:39











          • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

            – W-B
            Nov 13 '18 at 20:41



















          2














          I'd use a function that we can map across inputs. This should be fast.



          The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



          def rm(x, y):
          i = x.find(y)
          if i > -1:
          j = len(y)
          return x[:i] + x[i+j:]
          else:
          return x

          df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

          df

          BAD_ADR1 GOOD_ADR1 Address Difference
          0 123 Fake Street 123 Fake Street Apt 101 Apt 101





          share|improve this answer

























          • Very cool, I guess this would be very expensive computationaly speaking?

            – Datanovice
            Nov 13 '18 at 20:28






          • 1





            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

            – piRSquared
            Nov 13 '18 at 20:29







          • 1





            Will do! Thanks sir will add this to my code base for reference!

            – Datanovice
            Nov 13 '18 at 21:28


















          1














          You can replace the bad address part from good address



          df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


          Bad_Address Good_Address Address_Difference
          0 123 Fake Street 123 Fake Street Apt 101 Apt 101





          share|improve this answer






















            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            3 Answers
            3






            active

            oldest

            votes








            3 Answers
            3






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3














            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer























            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

              – L. Taylor
              Nov 13 '18 at 20:39











            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

              – W-B
              Nov 13 '18 at 20:41
















            3














            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer























            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

              – L. Taylor
              Nov 13 '18 at 20:39











            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

              – W-B
              Nov 13 '18 at 20:41














            3












            3








            3







            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")





            share|improve this answer













            Using replace with regex



            data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 13 '18 at 20:25









            W-BW-B

            107k83265




            107k83265












            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

              – L. Taylor
              Nov 13 '18 at 20:39











            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

              – W-B
              Nov 13 '18 at 20:41


















            • Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

              – L. Taylor
              Nov 13 '18 at 20:39











            • @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

              – W-B
              Nov 13 '18 at 20:41

















            Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

            – L. Taylor
            Nov 13 '18 at 20:39





            Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

            – L. Taylor
            Nov 13 '18 at 20:39













            @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

            – W-B
            Nov 13 '18 at 20:41






            @L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

            – W-B
            Nov 13 '18 at 20:41














            2














            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer

























            • Very cool, I guess this would be very expensive computationaly speaking?

              – Datanovice
              Nov 13 '18 at 20:28






            • 1





              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

              – piRSquared
              Nov 13 '18 at 20:29







            • 1





              Will do! Thanks sir will add this to my code base for reference!

              – Datanovice
              Nov 13 '18 at 21:28















            2














            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer

























            • Very cool, I guess this would be very expensive computationaly speaking?

              – Datanovice
              Nov 13 '18 at 20:28






            • 1





              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

              – piRSquared
              Nov 13 '18 at 20:29







            • 1





              Will do! Thanks sir will add this to my code base for reference!

              – Datanovice
              Nov 13 '18 at 21:28













            2












            2








            2







            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer















            I'd use a function that we can map across inputs. This should be fast.



            The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.



            def rm(x, y):
            i = x.find(y)
            if i > -1:
            j = len(y)
            return x[:i] + x[i+j:]
            else:
            return x

            df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

            df

            BAD_ADR1 GOOD_ADR1 Address Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 13 '18 at 20:41

























            answered Nov 13 '18 at 20:26









            piRSquaredpiRSquared

            154k22146288




            154k22146288












            • Very cool, I guess this would be very expensive computationaly speaking?

              – Datanovice
              Nov 13 '18 at 20:28






            • 1





              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

              – piRSquared
              Nov 13 '18 at 20:29







            • 1





              Will do! Thanks sir will add this to my code base for reference!

              – Datanovice
              Nov 13 '18 at 21:28

















            • Very cool, I guess this would be very expensive computationaly speaking?

              – Datanovice
              Nov 13 '18 at 20:28






            • 1





              No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

              – piRSquared
              Nov 13 '18 at 20:29







            • 1





              Will do! Thanks sir will add this to my code base for reference!

              – Datanovice
              Nov 13 '18 at 21:28
















            Very cool, I guess this would be very expensive computationaly speaking?

            – Datanovice
            Nov 13 '18 at 20:28





            Very cool, I guess this would be very expensive computationaly speaking?

            – Datanovice
            Nov 13 '18 at 20:28




            1




            1





            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

            – piRSquared
            Nov 13 '18 at 20:29






            No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

            – piRSquared
            Nov 13 '18 at 20:29





            1




            1





            Will do! Thanks sir will add this to my code base for reference!

            – Datanovice
            Nov 13 '18 at 21:28





            Will do! Thanks sir will add this to my code base for reference!

            – Datanovice
            Nov 13 '18 at 21:28











            1














            You can replace the bad address part from good address



            df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


            Bad_Address Good_Address Address_Difference
            0 123 Fake Street 123 Fake Street Apt 101 Apt 101





            share|improve this answer



























              1














              You can replace the bad address part from good address



              df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


              Bad_Address Good_Address Address_Difference
              0 123 Fake Street 123 Fake Street Apt 101 Apt 101





              share|improve this answer

























                1












                1








                1







                You can replace the bad address part from good address



                df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


                Bad_Address Good_Address Address_Difference
                0 123 Fake Street 123 Fake Street Apt 101 Apt 101





                share|improve this answer













                You can replace the bad address part from good address



                df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


                Bad_Address Good_Address Address_Difference
                0 123 Fake Street 123 Fake Street Apt 101 Apt 101






                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 13 '18 at 20:25









                VaishaliVaishali

                19.4k41030




                19.4k41030



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    How to read a connectionString WITH PROVIDER in .NET Core?

                    In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                    Museum of Modern and Contemporary Art of Trento and Rovereto