pdftools: Embeded NUL in string










1















I'm trying to download a file and read it's info automatically, from the following link:



http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf



The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.



library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
Embedded NUL in string.


What else I've tried:



  • Tried downloading using mode = "wb"

  • Tried downloading with httr using the write_disk method

  • Tried downloading manually on windows and it works! :(

My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.










share|improve this question




























    1















    I'm trying to download a file and read it's info automatically, from the following link:



    http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf



    The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.



    library(pdftools)
    link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
    download.file(link, "somefile.pdf")
    pdf_info("somefile.pdf")
    Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
    Embedded NUL in string.


    What else I've tried:



    • Tried downloading using mode = "wb"

    • Tried downloading with httr using the write_disk method

    • Tried downloading manually on windows and it works! :(

    My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.










    share|improve this question


























      1












      1








      1








      I'm trying to download a file and read it's info automatically, from the following link:



      http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf



      The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.



      library(pdftools)
      link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
      download.file(link, "somefile.pdf")
      pdf_info("somefile.pdf")
      Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
      Embedded NUL in string.


      What else I've tried:



      • Tried downloading using mode = "wb"

      • Tried downloading with httr using the write_disk method

      • Tried downloading manually on windows and it works! :(

      My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.










      share|improve this question
















      I'm trying to download a file and read it's info automatically, from the following link:



      http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf



      The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.



      library(pdftools)
      link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
      download.file(link, "somefile.pdf")
      pdf_info("somefile.pdf")
      Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
      Embedded NUL in string.


      What else I've tried:



      • Tried downloading using mode = "wb"

      • Tried downloading with httr using the write_disk method

      • Tried downloading manually on windows and it works! :(

      My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.







      r poppler






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 4:26







      Brandon Bertelsen

















      asked Nov 14 '18 at 4:16









      Brandon BertelsenBrandon Bertelsen

      25.4k27123228




      25.4k27123228






















          1 Answer
          1






          active

          oldest

          votes


















          2














          So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.



          If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox (note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools.



          When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:



          pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
          ## # A tibble: 14 x 2
          ## page text
          ## <int> <chr>
          ## 1 1 "n"
          ## 2 2 "n"
          ## 3 3 "n"
          ## 4 4 "n"
          ## 5 5 "n"
          ## 6 6 "n"
          ## 7 7 "n"
          ## 8 8 "n"
          ## 9 9 "n"
          ## 10 10 "n"
          ## 11 11 "n"
          ## 12 12 "n"
          ## 13 13 "n"
          ## 14 14 "n"


          Boom: no text.



          You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.






          share|improve this answer























          • I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

            – Brandon Bertelsen
            Nov 14 '18 at 21:10











          • No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

            – hrbrmstr
            Nov 14 '18 at 21:11











          • The Embedded NUL in string bug has been fixed in pdftools 2.0.

            – Jeroen
            Dec 12 '18 at 14:59










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293124%2fpdftools-embeded-nul-in-string%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          2














          So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.



          If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox (note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools.



          When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:



          pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
          ## # A tibble: 14 x 2
          ## page text
          ## <int> <chr>
          ## 1 1 "n"
          ## 2 2 "n"
          ## 3 3 "n"
          ## 4 4 "n"
          ## 5 5 "n"
          ## 6 6 "n"
          ## 7 7 "n"
          ## 8 8 "n"
          ## 9 9 "n"
          ## 10 10 "n"
          ## 11 11 "n"
          ## 12 12 "n"
          ## 13 13 "n"
          ## 14 14 "n"


          Boom: no text.



          You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.






          share|improve this answer























          • I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

            – Brandon Bertelsen
            Nov 14 '18 at 21:10











          • No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

            – hrbrmstr
            Nov 14 '18 at 21:11











          • The Embedded NUL in string bug has been fixed in pdftools 2.0.

            – Jeroen
            Dec 12 '18 at 14:59















          2














          So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.



          If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox (note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools.



          When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:



          pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
          ## # A tibble: 14 x 2
          ## page text
          ## <int> <chr>
          ## 1 1 "n"
          ## 2 2 "n"
          ## 3 3 "n"
          ## 4 4 "n"
          ## 5 5 "n"
          ## 6 6 "n"
          ## 7 7 "n"
          ## 8 8 "n"
          ## 9 9 "n"
          ## 10 10 "n"
          ## 11 11 "n"
          ## 12 12 "n"
          ## 13 13 "n"
          ## 14 14 "n"


          Boom: no text.



          You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.






          share|improve this answer























          • I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

            – Brandon Bertelsen
            Nov 14 '18 at 21:10











          • No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

            – hrbrmstr
            Nov 14 '18 at 21:11











          • The Embedded NUL in string bug has been fixed in pdftools 2.0.

            – Jeroen
            Dec 12 '18 at 14:59













          2












          2








          2







          So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.



          If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox (note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools.



          When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:



          pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
          ## # A tibble: 14 x 2
          ## page text
          ## <int> <chr>
          ## 1 1 "n"
          ## 2 2 "n"
          ## 3 3 "n"
          ## 4 4 "n"
          ## 5 5 "n"
          ## 6 6 "n"
          ## 7 7 "n"
          ## 8 8 "n"
          ## 9 9 "n"
          ## 10 10 "n"
          ## 11 11 "n"
          ## 12 12 "n"
          ## 13 13 "n"
          ## 14 14 "n"


          Boom: no text.



          You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.






          share|improve this answer













          So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.



          If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox (note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools.



          When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:



          pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
          ## # A tibble: 14 x 2
          ## page text
          ## <int> <chr>
          ## 1 1 "n"
          ## 2 2 "n"
          ## 3 3 "n"
          ## 4 4 "n"
          ## 5 5 "n"
          ## 6 6 "n"
          ## 7 7 "n"
          ## 8 8 "n"
          ## 9 9 "n"
          ## 10 10 "n"
          ## 11 11 "n"
          ## 12 12 "n"
          ## 13 13 "n"
          ## 14 14 "n"


          Boom: no text.



          You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 14:01









          hrbrmstrhrbrmstr

          60.8k688150




          60.8k688150












          • I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

            – Brandon Bertelsen
            Nov 14 '18 at 21:10











          • No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

            – hrbrmstr
            Nov 14 '18 at 21:11











          • The Embedded NUL in string bug has been fixed in pdftools 2.0.

            – Jeroen
            Dec 12 '18 at 14:59

















          • I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

            – Brandon Bertelsen
            Nov 14 '18 at 21:10











          • No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

            – hrbrmstr
            Nov 14 '18 at 21:11











          • The Embedded NUL in string bug has been fixed in pdftools 2.0.

            – Jeroen
            Dec 12 '18 at 14:59
















          I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

          – Brandon Bertelsen
          Nov 14 '18 at 21:10





          I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

          – Brandon Bertelsen
          Nov 14 '18 at 21:10













          No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

          – hrbrmstr
          Nov 14 '18 at 21:11





          No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

          – hrbrmstr
          Nov 14 '18 at 21:11













          The Embedded NUL in string bug has been fixed in pdftools 2.0.

          – Jeroen
          Dec 12 '18 at 14:59





          The Embedded NUL in string bug has been fixed in pdftools 2.0.

          – Jeroen
          Dec 12 '18 at 14:59

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293124%2fpdftools-embeded-nul-in-string%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto