Extract numerical value before a string in R










2















I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.



I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!



example string: >742,811 people own these<



-> 742,811










share|improve this question



















  • 1





    Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

    – hrbrmstr
    Nov 14 '18 at 2:46
















2















I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.



I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!



example string: >742,811 people own these<



-> 742,811










share|improve this question



















  • 1





    Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

    – hrbrmstr
    Nov 14 '18 at 2:46














2












2








2








I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.



I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!



example string: >742,811 people own these<



-> 742,811










share|improve this question
















I have been mucking around with regex strings and strsplit but can't figure out how to solve my problem.



I have a collection of html documents that will always contain the phrase "people own these". I want to extract the number immediately preceding this phrase. i.e. '732,234 people own these' - I'm hoping to capture the number 732,234 (including the comma, though I don't care if it's removed).
The number and phrase are always surrounded by a . I tried using Xpath but that seemed even harder than a regex expression. Any help or advice is greatly appreciated!



example string: >742,811 people own these<



-> 742,811







r






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 2:45









hrbrmstr

60.8k688150




60.8k688150










asked Nov 14 '18 at 2:19









PermafrostPermafrost

275




275







  • 1





    Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

    – hrbrmstr
    Nov 14 '18 at 2:46













  • 1





    Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

    – hrbrmstr
    Nov 14 '18 at 2:46








1




1





Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

– hrbrmstr
Nov 14 '18 at 2:46






Please do not use regular expressions to work with HTML. Can you please post a representative sample of the actual HTML or a link to the source? You should be using XML operations. It's kind of sad two folks are aiding this path fraught with peril.

– hrbrmstr
Nov 14 '18 at 2:46













2 Answers
2






active

oldest

votes


















2














Try using str_extract_all from the stringr library:



str_extract_all(data, "\d1,3(?:,\d3)*(?:\.\d+)?(?= people own these)")





share|improve this answer

























  • It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

    – Permafrost
    Nov 14 '18 at 3:02






  • 2





    @Permafrost: How could anyone know the answer to that question, given the lack of any example?

    – 42-
    Nov 14 '18 at 3:09











  • @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

    – Tim Biegeleisen
    Nov 14 '18 at 3:11











  • Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

    – Permafrost
    Nov 14 '18 at 4:31


















3














Could you please try following.



val <- "742,811 people own these"
gsub(' [a-zA-Z]+',"",val)


Output will be as follows.



[1] "742,811"


Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.






share|improve this answer






















    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292264%2fextract-numerical-value-before-a-string-in-r%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    Try using str_extract_all from the stringr library:



    str_extract_all(data, "\d1,3(?:,\d3)*(?:\.\d+)?(?= people own these)")





    share|improve this answer

























    • It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

      – Permafrost
      Nov 14 '18 at 3:02






    • 2





      @Permafrost: How could anyone know the answer to that question, given the lack of any example?

      – 42-
      Nov 14 '18 at 3:09











    • @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

      – Tim Biegeleisen
      Nov 14 '18 at 3:11











    • Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

      – Permafrost
      Nov 14 '18 at 4:31















    2














    Try using str_extract_all from the stringr library:



    str_extract_all(data, "\d1,3(?:,\d3)*(?:\.\d+)?(?= people own these)")





    share|improve this answer

























    • It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

      – Permafrost
      Nov 14 '18 at 3:02






    • 2





      @Permafrost: How could anyone know the answer to that question, given the lack of any example?

      – 42-
      Nov 14 '18 at 3:09











    • @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

      – Tim Biegeleisen
      Nov 14 '18 at 3:11











    • Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

      – Permafrost
      Nov 14 '18 at 4:31













    2












    2








    2







    Try using str_extract_all from the stringr library:



    str_extract_all(data, "\d1,3(?:,\d3)*(?:\.\d+)?(?= people own these)")





    share|improve this answer















    Try using str_extract_all from the stringr library:



    str_extract_all(data, "\d1,3(?:,\d3)*(?:\.\d+)?(?= people own these)")






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 14 '18 at 2:32

























    answered Nov 14 '18 at 2:26









    Tim BiegeleisenTim Biegeleisen

    224k1391143




    224k1391143












    • It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

      – Permafrost
      Nov 14 '18 at 3:02






    • 2





      @Permafrost: How could anyone know the answer to that question, given the lack of any example?

      – 42-
      Nov 14 '18 at 3:09











    • @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

      – Tim Biegeleisen
      Nov 14 '18 at 3:11











    • Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

      – Permafrost
      Nov 14 '18 at 4:31

















    • It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

      – Permafrost
      Nov 14 '18 at 3:02






    • 2





      @Permafrost: How could anyone know the answer to that question, given the lack of any example?

      – 42-
      Nov 14 '18 at 3:09











    • @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

      – Tim Biegeleisen
      Nov 14 '18 at 3:11











    • Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

      – Permafrost
      Nov 14 '18 at 4:31
















    It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

    – Permafrost
    Nov 14 '18 at 3:02





    It works nearly perfectly, returns a large list of empty characters and the value i'm after. Is there any reason it seems to return a bunch of listed NA's as well?

    – Permafrost
    Nov 14 '18 at 3:02




    2




    2





    @Permafrost: How could anyone know the answer to that question, given the lack of any example?

    – 42-
    Nov 14 '18 at 3:09





    @Permafrost: How could anyone know the answer to that question, given the lack of any example?

    – 42-
    Nov 14 '18 at 3:09













    @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

    – Tim Biegeleisen
    Nov 14 '18 at 3:11





    @Permafrost I'm not in front of an R console at the moment, but I would suggest starting with a smaller test text, and see if you can do some debugging.

    – Tim Biegeleisen
    Nov 14 '18 at 3:11













    Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

    – Permafrost
    Nov 14 '18 at 4:31





    Thanks very much, I was just after a quick fix so I used 'str_extract' to return a character vector not a list, then used x[!is.na(x)] (where x is the character vector) to remove the NA's and get the final value. I'll keep reading up on Xpath, but this is a fast way to get what I need.

    – Permafrost
    Nov 14 '18 at 4:31













    3














    Could you please try following.



    val <- "742,811 people own these"
    gsub(' [a-zA-Z]+',"",val)


    Output will be as follows.



    [1] "742,811"


    Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.






    share|improve this answer



























      3














      Could you please try following.



      val <- "742,811 people own these"
      gsub(' [a-zA-Z]+',"",val)


      Output will be as follows.



      [1] "742,811"


      Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.






      share|improve this answer

























        3












        3








        3







        Could you please try following.



        val <- "742,811 people own these"
        gsub(' [a-zA-Z]+',"",val)


        Output will be as follows.



        [1] "742,811"


        Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.






        share|improve this answer













        Could you please try following.



        val <- "742,811 people own these"
        gsub(' [a-zA-Z]+',"",val)


        Output will be as follows.



        [1] "742,811"


        Explanation: using gsub(global substitution) function of R here. Putting condition here where it should replace all occurrences of space with small or capital alphabets with NULL for variable val.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 14 '18 at 2:25









        RavinderSingh13RavinderSingh13

        27.4k41538




        27.4k41538



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53292264%2fextract-numerical-value-before-a-string-in-r%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto