Python Regex: Find specific phrase in any form in text (including if followed by . or ,)

I'm trying to find when a specific product name is mentioned in customer notes (i.e. un-standardized, messy text). The product name is "Lending QB." Within the text, the product name can appear in any of the follow ways:

str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'

Here is the regex that mostly works:

df['lendingQB'] = df['Text'].str.findall('(?i)(?<!S)lendings?qb(?!S)', re.IGNORECASE)

Using regex101.com to test, and confirming within my Python program, I can capture the product name in strings (str) 1-3, but not 4 and 5; which makes me believe the issue is with not finding the product name when it's followed by a punctuation mark.

My understanding is the S would include commas and periods.

I tried adding |[,.] to the regex but then nothing matches:

'(?i)(?<!S)lendings?qb(?!S|[,.])'

(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")

Any suggestions?

asked Nov 15 '18 at 20:12

Amanda

575

Just to note, if you use any boundary, it is possible to not match a product name. That regex is (?i)lendings?qb. Using a boundary actually qualifies what you want to match. So, in that sense no answer here is even close to your objective. Just saying .... Also, a simple underscore _ in front/behind your product name will not get matched using (?<!S) and b. So beware when you think something is actually robust, it isn't.

– sln
Nov 15 '18 at 22:07

add a comment |

str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'

Here is the regex that mostly works:

df['lendingQB'] = df['Text'].str.findall('(?i)(?<!S)lendings?qb(?!S)', re.IGNORECASE)

My understanding is the S would include commas and periods.

I tried adding |[,.] to the regex but then nothing matches:

'(?i)(?<!S)lendings?qb(?!S|[,.])'

(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")

Any suggestions?

asked Nov 15 '18 at 20:12

Amanda

575

Just to note, if you use any boundary, it is possible to not match a product name. That regex is (?i)lendings?qb. Using a boundary actually qualifies what you want to match. So, in that sense no answer here is even close to your objective. Just saying .... Also, a simple underscore _ in front/behind your product name will not get matched using (?<!S) and b. So beware when you think something is actually robust, it isn't.

– sln
Nov 15 '18 at 22:07

add a comment |

str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'

Here is the regex that mostly works:

df['lendingQB'] = df['Text'].str.findall('(?i)(?<!S)lendings?qb(?!S)', re.IGNORECASE)

My understanding is the S would include commas and periods.

I tried adding |[,.] to the regex but then nothing matches:

'(?i)(?<!S)lendings?qb(?!S|[,.])'

(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")

Any suggestions?

asked Nov 15 '18 at 20:12

Amanda

575

str1 ='Lending QB is a great product.'
str2 ='lending qb is great.'
str3 ='I don't think lendingqb is great.'
str4 ='I like Lending QB, but not always.'
str5 ='The best product is Lending qb.'

Here is the regex that mostly works:

df['lendingQB'] = df['Text'].str.findall('(?i)(?<!S)lendings?qb(?!S)', re.IGNORECASE)

My understanding is the S would include commas and periods.

I tried adding |[,.] to the regex but then nothing matches:

'(?i)(?<!S)lendings?qb(?!S|[,.])'

(I realize the IGNORECASE is redundant, but to test with regex101.com, I added the "(?i)")

Any suggestions?

python regex

asked Nov 15 '18 at 20:12

Amanda

575

asked Nov 15 '18 at 20:12

Amanda

575

asked Nov 15 '18 at 20:12

Amanda

575

asked Nov 15 '18 at 20:12

Amanda

575

asked Nov 15 '18 at 20:12

Amanda

575

Just to note, if you use any boundary, it is possible to not match a product name. That regex is (?i)lendings?qb. Using a boundary actually qualifies what you want to match. So, in that sense no answer here is even close to your objective. Just saying .... Also, a simple underscore _ in front/behind your product name will not get matched using (?<!S) and b. So beware when you think something is actually robust, it isn't.

– sln
Nov 15 '18 at 22:07

add a comment |

Just to note, if you use any boundary, it is possible to not match a product name. That regex is (?i)lendings?qb. Using a boundary actually qualifies what you want to match. So, in that sense no answer here is even close to your objective. Just saying .... Also, a simple underscore _ in front/behind your product name will not get matched using (?<!S) and b. So beware when you think something is actually robust, it isn't.

– sln
Nov 15 '18 at 22:07

Just to note, if you use any boundary, it is possible to not match a product name. That regex is (?i)lendings?qb. Using a boundary actually qualifies what you want to match. So, in that sense no answer here is even close to your objective. Just saying .... Also, a simple underscore _ in front/behind your product name will not get matched using (?<!S) and b. So beware when you think something is actually robust, it isn't.

– sln
Nov 15 '18 at 22:07

add a comment |

4 Answers
4

active

oldest

votes

You have correctly identified one issue in the regex (punctuation immediately after QB), but there is a second edge case to consider given that the input is messy -- what if there are multiple spaces in Lending QB?.

I believe the most robust solution to your problem is:

(?i)(?<!S)lendings*qbb

b enforces that QB occur at the end of a word, automatically considering punctuation.

s? was replaced with s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.

PS. Another point to consider is that b terminates on all punctuation, (?=s|[,.]) will only terminate on the given punctuation: , or . in this case. Given the wide range of possible punctuation (colon, semicolon, dash, hyphen, emdash...) I would strongly recommend b over (?=s|[,.]). Unless you want precise control over allowable terminating punctuation of course...

PPS. further test cases to illustrate my points

str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

add a comment |

The pattern (?!S) uses a negative lookahead to check what follows is not a non whitespace character.

What you could so is replace the (?!S) with a word boundary b to let it not be part of a larger match:

(?i)(?<!S)lendings?qbb

Regex demo

Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[s,.]|$)

For example:

str5 ="The best product is Lending qb."
print(re.findall(r'(?<!S)lendings?qb(?=[s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

add a comment |

This (?!S) is a forward whitespace boundary.

It is really this (?![^s]) a negative of a negative

with the added benefit of it matching at the EOS (end of string).

What that means is you can use the negative class form to add characters

that qualify as a boundary.

So, just put the period and comma in with the whitespace.

(?i)(?<![^s,.])lendings?qb(?![^s,.])

https://regex101.com/r/BrOj2J/1

As a tutorial point, this concept encapsulates multiple assertions

and is basic engine Boolean class logic which speeds up the engine

by a ten fold factor by comparison.

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

add a comment |

Thank you "The fourth bird", "sln", and "Mark_Anderson". Your answers provided solutions and also were very educational. I went with Mark's answer since it seemed to be the most robust, which is where I'm trying to get to. Ideally, I do want to capture all cases when the product name is mentioned, no matter how messy it's typed.

I changed my code to this:

df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!S)lendings*qbb', re.IGNORECASE)

answered Nov 15 '18 at 21:25

Amanda

575

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53327215%2fpython-regex-find-specific-phrase-in-any-form-in-text-including-if-followed-by%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

I believe the most robust solution to your problem is:

(?i)(?<!S)lendings*qbb

b enforces that QB occur at the end of a word, automatically considering punctuation.

s? was replaced with s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.

PPS. further test cases to illustrate my points

str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

add a comment |

I believe the most robust solution to your problem is:

(?i)(?<!S)lendings*qbb

b enforces that QB occur at the end of a word, automatically considering punctuation.

s? was replaced with s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.

PPS. further test cases to illustrate my points

str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

add a comment |

I believe the most robust solution to your problem is:

(?i)(?<!S)lendings*qbb

b enforces that QB occur at the end of a word, automatically considering punctuation.

s? was replaced with s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.

PPS. further test cases to illustrate my points

str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

I believe the most robust solution to your problem is:

(?i)(?<!S)lendings*qbb

b enforces that QB occur at the end of a word, automatically considering punctuation.

s? was replaced with s* to allow any amount of whitespace to be
a match, rather than just zero-to-one whitespaces.

PPS. further test cases to illustrate my points

str6 ='Lending Qb: simply the best'
str7 ='I'm a fan of lending QB'

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

answered Nov 15 '18 at 20:57

Mark_Anderson

426317

add a comment |

The pattern (?!S) uses a negative lookahead to check what follows is not a non whitespace character.

What you could so is replace the (?!S) with a word boundary b to let it not be part of a larger match:

(?i)(?<!S)lendings?qbb

Regex demo

Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[s,.]|$)

For example:

str5 ="The best product is Lending qb."
print(re.findall(r'(?<!S)lendings?qb(?=[s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

add a comment |

The pattern (?!S) uses a negative lookahead to check what follows is not a non whitespace character.

What you could so is replace the (?!S) with a word boundary b to let it not be part of a larger match:

(?i)(?<!S)lendings?qbb

Regex demo

Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[s,.]|$)

For example:

str5 ="The best product is Lending qb."
print(re.findall(r'(?<!S)lendings?qb(?=[s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

add a comment |

The pattern (?!S) uses a negative lookahead to check what follows is not a non whitespace character.

What you could so is replace the (?!S) with a word boundary b to let it not be part of a larger match:

(?i)(?<!S)lendings?qbb

Regex demo

Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[s,.]|$)

For example:

str5 ="The best product is Lending qb."
print(re.findall(r'(?<!S)lendings?qb(?=[s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

The pattern (?!S) uses a negative lookahead to check what follows is not a non whitespace character.

What you could so is replace the (?!S) with a word boundary b to let it not be part of a larger match:

(?i)(?<!S)lendings?qbb

Regex demo

Another way could be to use a positive lookahead to check for a whitespace character or ., or the end of the string using (?=[s,.]|$)

For example:

str5 ="The best product is Lending qb."
print(re.findall(r'(?<!S)lendings?qb(?=[s,.]|$)', str5, re.IGNORECASE)) # ['Lending qb']

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

edited Nov 15 '18 at 21:04

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

answered Nov 15 '18 at 20:18

The fourth bird

24.5k81629

add a comment |

This (?!S) is a forward whitespace boundary.

It is really this (?![^s]) a negative of a negative

with the added benefit of it matching at the EOS (end of string).

What that means is you can use the negative class form to add characters

that qualify as a boundary.

So, just put the period and comma in with the whitespace.

(?i)(?<![^s,.])lendings?qb(?![^s,.])

https://regex101.com/r/BrOj2J/1

As a tutorial point, this concept encapsulates multiple assertions

and is basic engine Boolean class logic which speeds up the engine

by a ten fold factor by comparison.

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

add a comment |

This (?!S) is a forward whitespace boundary.

It is really this (?![^s]) a negative of a negative

with the added benefit of it matching at the EOS (end of string).

What that means is you can use the negative class form to add characters

that qualify as a boundary.

So, just put the period and comma in with the whitespace.

(?i)(?<![^s,.])lendings?qb(?![^s,.])

https://regex101.com/r/BrOj2J/1

As a tutorial point, this concept encapsulates multiple assertions

and is basic engine Boolean class logic which speeds up the engine

by a ten fold factor by comparison.

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

add a comment |

This (?!S) is a forward whitespace boundary.

It is really this (?![^s]) a negative of a negative

with the added benefit of it matching at the EOS (end of string).

What that means is you can use the negative class form to add characters

that qualify as a boundary.

So, just put the period and comma in with the whitespace.

(?i)(?<![^s,.])lendings?qb(?![^s,.])

https://regex101.com/r/BrOj2J/1

As a tutorial point, this concept encapsulates multiple assertions

and is basic engine Boolean class logic which speeds up the engine

by a ten fold factor by comparison.

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

This (?!S) is a forward whitespace boundary.

It is really this (?![^s]) a negative of a negative

with the added benefit of it matching at the EOS (end of string).

What that means is you can use the negative class form to add characters

that qualify as a boundary.

So, just put the period and comma in with the whitespace.

(?i)(?<![^s,.])lendings?qb(?![^s,.])

https://regex101.com/r/BrOj2J/1

As a tutorial point, this concept encapsulates multiple assertions

and is basic engine Boolean class logic which speeds up the engine

by a ten fold factor by comparison.

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

edited Nov 15 '18 at 21:07

answered Nov 15 '18 at 20:56

sln

26.8k31638

answered Nov 15 '18 at 20:56

sln

26.8k31638

answered Nov 15 '18 at 20:56

sln

26.8k31638

add a comment |

I changed my code to this:

df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!S)lendings*qbb', re.IGNORECASE)

answered Nov 15 '18 at 21:25

Amanda

575

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

add a comment |

I changed my code to this:

df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!S)lendings*qbb', re.IGNORECASE)

answered Nov 15 '18 at 21:25

Amanda

575

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

add a comment |

I changed my code to this:

df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!S)lendings*qbb', re.IGNORECASE)

answered Nov 15 '18 at 21:25

Amanda

575

I changed my code to this:

df['lendingQB'] = df['Text'].str.findall(r'(?i)(?<!S)lendings*qbb', re.IGNORECASE)

answered Nov 15 '18 at 21:25

Amanda

575

answered Nov 15 '18 at 21:25

Amanda

575

answered Nov 15 '18 at 21:25

Amanda

575

answered Nov 15 '18 at 21:25

Amanda

575

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

add a comment |

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

You're welcome. One further thought: findall will just return the literal characters Lending QB. From the code snippet I presume a boolean flag might be more useful for you? in which case .match() is a straight replacement for .findall(), or perhaps bool(df['Text'].str.match(r'(?i)(?<!S)lendings*qbb'))

– Mark_Anderson
Nov 15 '18 at 21:41

Thanks, Mark_Anderson. This is very helpful info!

– Amanda
Nov 29 '18 at 14:43

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj