How do I create a new column in pandas from the difference of two string columns?
How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
python regex pandas
add a comment |
How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
python regex pandas
add a comment |
How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
python regex pandas
How can I create a new column in pandas that is the result of the difference of two other columns consisting of strings?
I have one column titled "Good_Address" which has entries like "123 Fake Street Apt 101" and another column titled "Bad_Address" which has entries like "123 Fake Street". I want the output in column "Address_Difference" to be " Apt101".
I've tried doing:
import pandas as pd
data = pd.read_csv("AddressFile.csv")
data['Address Difference'] = data['GOOD_ADR1'].replace(data['BAD_ADR1'],'')
data['Address Difference']
but this does not work. It seems that the result is just equal to "123 Fake Street Apt101" (good address in the example above).
I've also tried:
data['Address Difference'] = data['GOOD_ADR1'].str.replace(data['BAD_ADR1'],'')
but this yields an error saying 'Series' objects are mutable, thus they cannot be hashed.
Any help would be appreciated.
Thanks
python regex pandas
python regex pandas
edited Nov 16 '18 at 19:30
Vaishali
19.4k41030
19.4k41030
asked Nov 13 '18 at 20:19
L. TaylorL. Taylor
112
112
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
add a comment |
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
add a comment |
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
add a comment |
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
add a comment |
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
Using replace
with regex
data['Address Difference']=data['GOOD_ADR1'].replace(regex=r'(?i)'+ data['BAD_ADR1'],value="")
answered Nov 13 '18 at 20:25
W-BW-B
107k83265
107k83265
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
add a comment |
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
Seems to work pretty well in most cases. Let me append a question to this: if I see a good address like 123 N Main St Apt 101 and a bad address like 123 North Main St, why does this code return a value equal to the good address (i.e. 123 N Main St Apt 101)?
– L. Taylor
Nov 13 '18 at 20:39
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
@L.Taylor since not match , it will return what you have in the good address column, pandas treat it as one string '123 N Main St Apt 101'
– W-B
Nov 13 '18 at 20:41
add a comment |
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
add a comment |
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
add a comment |
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
I'd use a function that we can map across inputs. This should be fast.
The function will use str.find
to see if the other string is a subset. If the result of str.find
is -1
then the substring could not be found. Otherwise, extricate the substring given the position it was found and the length of the substring.
def rm(x, y):
i = x.find(y)
if i > -1:
j = len(y)
return x[:i] + x[i+j:]
else:
return x
df['Address Difference'] = [*map(rm, df.GOOD_ADR1, df.BAD_ADR1)]
df
BAD_ADR1 GOOD_ADR1 Address Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
edited Nov 13 '18 at 20:41
answered Nov 13 '18 at 20:26
piRSquaredpiRSquared
154k22146288
154k22146288
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
add a comment |
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
Very cool, I guess this would be very expensive computationaly speaking?
– Datanovice
Nov 13 '18 at 20:28
1
1
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
No. You should try it. In fact, I clock it at 1000 times faster on a dataframe of length 10,000
– piRSquared
Nov 13 '18 at 20:29
1
1
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
Will do! Thanks sir will add this to my code base for reference!
– Datanovice
Nov 13 '18 at 21:28
add a comment |
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
add a comment |
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
add a comment |
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
You can replace the bad address part from good address
df['Address_Difference'] = df['Good_Address'].replace(df['Bad_Address'], '', regex = True).str.strip()
Bad_Address Good_Address Address_Difference
0 123 Fake Street 123 Fake Street Apt 101 Apt 101
answered Nov 13 '18 at 20:25
VaishaliVaishali
19.4k41030
19.4k41030
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53288887%2fhow-do-i-create-a-new-column-in-pandas-from-the-difference-of-two-string-columns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown