Regex Patterns causing StackoverFlow
I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.
Right now these are the two different regex patterns I'm using to remove the specified tags.
//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");
//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");
This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError
.
I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|"
in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.
I've managed to iteratively use these patterns on different test files up to 1000s of times.
My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?
If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.
Using print statements it seems that the overflow may be happening when trying to match the pattern:
"<script[\s\S]*?>[\s\S]*?</script>"
Additionally, I was told I could use this instead:
"<script[\s\S]+?>[\s\S]+?</script>"
Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.
Here is the stack trace I receive:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
I'm open to any and all advice. Thank you in advanced.
java regex string stack-overflow
|
show 2 more comments
I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.
Right now these are the two different regex patterns I'm using to remove the specified tags.
//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");
//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");
This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError
.
I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|"
in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.
I've managed to iteratively use these patterns on different test files up to 1000s of times.
My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?
If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.
Using print statements it seems that the overflow may be happening when trying to match the pattern:
"<script[\s\S]*?>[\s\S]*?</script>"
Additionally, I was told I could use this instead:
"<script[\s\S]+?>[\s\S]+?</script>"
Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.
Here is the stack trace I receive:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
I'm open to any and all advice. Thank you in advanced.
java regex string stack-overflow
1
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
2
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
1
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46
|
show 2 more comments
I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.
Right now these are the two different regex patterns I'm using to remove the specified tags.
//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");
//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");
This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError
.
I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|"
in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.
I've managed to iteratively use these patterns on different test files up to 1000s of times.
My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?
If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.
Using print statements it seems that the overflow may be happening when trying to match the pattern:
"<script[\s\S]*?>[\s\S]*?</script>"
Additionally, I was told I could use this instead:
"<script[\s\S]+?>[\s\S]+?</script>"
Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.
Here is the stack trace I receive:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
I'm open to any and all advice. Thank you in advanced.
java regex string stack-overflow
I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.
Right now these are the two different regex patterns I'm using to remove the specified tags.
//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");
//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");
This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError
.
I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|"
in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.
I've managed to iteratively use these patterns on different test files up to 1000s of times.
My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?
If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.
Using print statements it seems that the overflow may be happening when trying to match the pattern:
"<script[\s\S]*?>[\s\S]*?</script>"
Additionally, I was told I could use this instead:
"<script[\s\S]+?>[\s\S]+?</script>"
Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.
Here is the stack trace I receive:
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)
I'm open to any and all advice. Thank you in advanced.
java regex string stack-overflow
java regex string stack-overflow
edited Nov 15 '18 at 16:14
Nicholas K
8,02161638
8,02161638
asked Nov 15 '18 at 15:20
Jonathan HindsJonathan Hinds
658
658
1
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
2
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
1
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46
|
show 2 more comments
1
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
2
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
1
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46
1
1
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
2
2
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
1
1
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46
|
show 2 more comments
1 Answer
1
active
oldest
votes
I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322588%2fregex-patterns-causing-stackoverflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.
add a comment |
I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.
add a comment |
I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.
I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.
answered Nov 17 '18 at 18:00
Jonathan HindsJonathan Hinds
658
658
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322588%2fregex-patterns-causing-stackoverflow%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.
– Jacob G.
Nov 15 '18 at 15:22
apologies for the lack of info. I'm using java8 for this application.
– Jonathan Hinds
Nov 15 '18 at 15:23
2
This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."
– Sofo Gial
Nov 15 '18 at 15:23
Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….
– VGR
Nov 15 '18 at 15:28
1
@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.
– Matthew Green
Nov 15 '18 at 16:46