How do I create a new column in pandas from the difference of two string columns?

How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?

I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".

I've tried doing:

import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'') 
data['Address Difference']

but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).

I've also tried:

data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')

but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.

Any help would be appreciated.

Thanks

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

add a comment |

How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?

I've tried doing:

import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'') 
data['Address Difference']

but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).

I've also tried:

data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')

but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.

Any help would be appreciated.

Thanks

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

add a comment |

How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?

I've tried doing:

import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'') 
data['Address Difference']

but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).

I've also tried:

data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')

but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.

Any help would be appreciated.

Thanks

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?

I've tried doing:

import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'') 
data['Address Difference']

but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).

I've also tried:

data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')

but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.

Any help would be appreciated.

Thanks

python regex pandas

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

edited Nov 16 '18 at 19:30

Vaishali

19.4k41030

asked Nov 13 '18 at 20:19

L. Taylor

112

asked Nov 13 '18 at 20:19

L. Taylor

112

asked Nov 13 '18 at 20:19

L. Taylor

112

add a comment |

3 Answers
3

active

oldest

votes

Using replace with regex

data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")

answered Nov 13 '18 at 20:25

W-B

107k83265

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

add a comment |

I'd use a function that we can map across inputs. This should be fast.

The function will use str.find to see if the other string is a subset. If the result of str.find is -1 then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.

def rm(x, y):
 i = x.find(y)
 if i > -1:
 j = len(y)
 return x[:i] + x[i+j:]
 else:
 return x

df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

df

 BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

1

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

1

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

add a comment |

You can replace the bad address part from good address

df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


 Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Using replace with regex

data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")

answered Nov 13 '18 at 20:25

W-B

107k83265

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

add a comment |

Using replace with regex

data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")

answered Nov 13 '18 at 20:25

W-B

107k83265

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

add a comment |

Using replace with regex

data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")

answered Nov 13 '18 at 20:25

W-B

107k83265

Using replace with regex

data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")

answered Nov 13 '18 at 20:25

W-B

107k83265

answered Nov 13 '18 at 20:25

W-B

107k83265

answered Nov 13 '18 at 20:25

W-B

107k83265

answered Nov 13 '18 at 20:25

W-B

107k83265

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

add a comment |

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?

– L. Taylor
Nov 13 '18 at 20:39

@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'

– W-B
Nov 13 '18 at 20:41

add a comment |

I'd use a function that we can map across inputs. This should be fast.

def rm(x, y):
 i = x.find(y)
 if i > -1:
 j = len(y)
 return x[:i] + x[i+j:]
 else:
 return x

df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

df

 BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

1

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

1

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

add a comment |

I'd use a function that we can map across inputs. This should be fast.

def rm(x, y):
 i = x.find(y)
 if i > -1:
 j = len(y)
 return x[:i] + x[i+j:]
 else:
 return x

df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

df

 BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

1

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

1

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

add a comment |

I'd use a function that we can map across inputs. This should be fast.

def rm(x, y):
 i = x.find(y)
 if i > -1:
 j = len(y)
 return x[:i] + x[i+j:]
 else:
 return x

df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

df

 BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

I'd use a function that we can map across inputs. This should be fast.

def rm(x, y):
 i = x.find(y)
 if i > -1:
 j = len(y)
 return x[:i] + x[i+j:]
 else:
 return x

df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]

df

 BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

edited Nov 13 '18 at 20:41

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

answered Nov 13 '18 at 20:26

piRSquared

154k22146288

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

1

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

1

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

add a comment |

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

1

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

1

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

Very cool, I guess this would be very expensive computationaly speaking?

– Datanovice
Nov 13 '18 at 20:28

No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000

– piRSquared
Nov 13 '18 at 20:29

Will do! Thanks sir will add this to my code base for reference!

– Datanovice
Nov 13 '18 at 21:28

add a comment |

You can replace the bad address part from good address

df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


 Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

add a comment |

You can replace the bad address part from good address

df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


 Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

add a comment |

You can replace the bad address part from good address

df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


 Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

You can replace the bad address part from good address

df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()


 Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

answered Nov 13 '18 at 20:25

Vaishali

19.4k41030

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj