Regex Patterns causing StackoverFlow










2















I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.



Right now these are the two different regex patterns I'm using to remove the specified tags.



//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");


This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.



I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.



I've managed to iteratively use these patterns on different test files up to 1000s of times.



My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?



If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.



Using print statements it seems that the overflow may be happening when trying to match the pattern:



"<script[\s\S]*?>[\s\S]*?</script>"


Additionally, I was told I could use this instead:



"<script[\s\S]+?>[\s\S]+?</script>"


Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.



Here is the stack trace I receive:



Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)


I'm open to any and all advice. Thank you in advanced.










share|improve this question



















  • 1





    What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

    – Jacob G.
    Nov 15 '18 at 15:22











  • apologies for the lack of info. I'm using java8 for this application.

    – Jonathan Hinds
    Nov 15 '18 at 15:23






  • 2





    This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

    – Sofo Gial
    Nov 15 '18 at 15:23












  • Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

    – VGR
    Nov 15 '18 at 15:28






  • 1





    @JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

    – Matthew Green
    Nov 15 '18 at 16:46















2















I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.



Right now these are the two different regex patterns I'm using to remove the specified tags.



//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");


This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.



I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.



I've managed to iteratively use these patterns on different test files up to 1000s of times.



My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?



If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.



Using print statements it seems that the overflow may be happening when trying to match the pattern:



"<script[\s\S]*?>[\s\S]*?</script>"


Additionally, I was told I could use this instead:



"<script[\s\S]+?>[\s\S]+?</script>"


Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.



Here is the stack trace I receive:



Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)


I'm open to any and all advice. Thank you in advanced.










share|improve this question



















  • 1





    What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

    – Jacob G.
    Nov 15 '18 at 15:22











  • apologies for the lack of info. I'm using java8 for this application.

    – Jonathan Hinds
    Nov 15 '18 at 15:23






  • 2





    This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

    – Sofo Gial
    Nov 15 '18 at 15:23












  • Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

    – VGR
    Nov 15 '18 at 15:28






  • 1





    @JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

    – Matthew Green
    Nov 15 '18 at 16:46













2












2








2


1






I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.



Right now these are the two different regex patterns I'm using to remove the specified tags.



//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");


This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.



I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.



I've managed to iteratively use these patterns on different test files up to 1000s of times.



My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?



If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.



Using print statements it seems that the overflow may be happening when trying to match the pattern:



"<script[\s\S]*?>[\s\S]*?</script>"


Additionally, I was told I could use this instead:



"<script[\s\S]+?>[\s\S]+?</script>"


Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.



Here is the stack trace I receive:



Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)


I'm open to any and all advice. Thank you in advanced.










share|improve this question
















I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.



Right now these are the two different regex patterns I'm using to remove the specified tags.



//remove style tags and style tag content
update = update.replaceAll("<style\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\s\S]*?>[\s\S]*?</script>", "");


This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.



I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.



I've managed to iteratively use these patterns on different test files up to 1000s of times.



My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?



If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.



Using print statements it seems that the overflow may be happening when trying to match the pattern:



"<script[\s\S]*?>[\s\S]*?</script>"


Additionally, I was told I could use this instead:



"<script[\s\S]+?>[\s\S]+?</script>"


Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.



Here is the stack trace I receive:



Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)


I'm open to any and all advice. Thank you in advanced.







java regex string stack-overflow






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 16:14









Nicholas K

8,02161638




8,02161638










asked Nov 15 '18 at 15:20









Jonathan HindsJonathan Hinds

658




658







  • 1





    What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

    – Jacob G.
    Nov 15 '18 at 15:22











  • apologies for the lack of info. I'm using java8 for this application.

    – Jonathan Hinds
    Nov 15 '18 at 15:23






  • 2





    This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

    – Sofo Gial
    Nov 15 '18 at 15:23












  • Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

    – VGR
    Nov 15 '18 at 15:28






  • 1





    @JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

    – Matthew Green
    Nov 15 '18 at 16:46












  • 1





    What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

    – Jacob G.
    Nov 15 '18 at 15:22











  • apologies for the lack of info. I'm using java8 for this application.

    – Jonathan Hinds
    Nov 15 '18 at 15:23






  • 2





    This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

    – Sofo Gial
    Nov 15 '18 at 15:23












  • Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

    – VGR
    Nov 15 '18 at 15:28






  • 1





    @JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

    – Matthew Green
    Nov 15 '18 at 16:46







1




1





What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

– Jacob G.
Nov 15 '18 at 15:22





What version of Java are you using? There were many regex updates in Java 9, so I'd update if you're using Java 8 (or below) and let us know if the problem persists.

– Jacob G.
Nov 15 '18 at 15:22













apologies for the lack of info. I'm using java8 for this application.

– Jonathan Hinds
Nov 15 '18 at 15:23





apologies for the lack of info. I'm using java8 for this application.

– Jonathan Hinds
Nov 15 '18 at 15:23




2




2





This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

– Sofo Gial
Nov 15 '18 at 15:23






This is a common problem with RegEx. Using the library below can help you overcome it: github.com/google/re2j . Quoting the library documentation: "In the worst case, the java.util.regex matcher may run forever, or exceed the available stack space and fail; this will never happen with RE2/J."

– Sofo Gial
Nov 15 '18 at 15:23














Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

– VGR
Nov 15 '18 at 15:28





Parsing HTML with a regular expression is not advisable. See stackoverflow.com/questions/701166/….

– VGR
Nov 15 '18 at 15:28




1




1





@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

– Matthew Green
Nov 15 '18 at 16:46





@JonathanHinds What VGR and most others will tell you is that you are better off treating your document as XML (or really HTML) and using a parser for that in your language to find the element and remove it rather than treating it as a complex string and trying to regex your way through the same process.

– Matthew Green
Nov 15 '18 at 16:46












1 Answer
1






active

oldest

votes


















0














I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322588%2fregex-patterns-causing-stackoverflow%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.






    share|improve this answer



























      0














      I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.






      share|improve this answer

























        0












        0








        0







        I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.






        share|improve this answer













        I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 17 '18 at 18:00









        Jonathan HindsJonathan Hinds

        658




        658





























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53322588%2fregex-patterns-causing-stackoverflow%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3