pdftools: Embeded NUL in string

I'm trying to download a file and read it's info automatically, from the following link:

http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf

The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.

library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : 
 Embedded NUL in string.

What else I've tried:

Tried downloading using mode = "wb"

Tried downloading with httr using the write_disk method

Tried downloading manually on windows and it works! :(

My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

add a comment |

I'm trying to download a file and read it's info automatically, from the following link:

http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf

The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.

library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : 
 Embedded NUL in string.

What else I've tried:

Tried downloading using mode = "wb"

Tried downloading with httr using the write_disk method

Tried downloading manually on windows and it works! :(

My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

add a comment |

I'm trying to download a file and read it's info automatically, from the following link:

http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf

The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.

library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : 
 Embedded NUL in string.

What else I've tried:

Tried downloading using mode = "wb"

Tried downloading with httr using the write_disk method

Tried downloading manually on windows and it works! :(

My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

I'm trying to download a file and read it's info automatically, from the following link:

http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf

The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.

library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) : 
 Embedded NUL in string.

What else I've tried:

Tried downloading using mode = "wb"

Tried downloading with httr using the write_disk method

Tried downloading manually on windows and it works! :(

My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.

r poppler

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

edited Nov 14 '18 at 4:26

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

asked Nov 14 '18 at 4:16

Brandon Bertelsen

25.4k27123228

add a comment |

1 Answer
1

active

oldest

votes

So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.

If rJava works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox _{(note the security warning there as I haven't updated the pdfbox JARs but the only vuln is a potential process denial of service)} to validate this since it's less fragile than pdftools.

When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:

pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text 
## <int> <chr>
## 1 1 "n" 
## 2 2 "n" 
## 3 3 "n" 
## 4 4 "n" 
## 5 5 "n" 
## 6 6 "n" 
## 7 7 "n" 
## 8 8 "n" 
## 9 9 "n" 
## 10 10 "n" 
## 11 11 "n" 
## 12 12 "n" 
## 13 13 "n" 
## 14 14 "n"

Boom: no text.

You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293124%2fpdftools-embeded-nul-in-string%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.

When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:

pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text 
## <int> <chr>
## 1 1 "n" 
## 2 2 "n" 
## 3 3 "n" 
## 4 4 "n" 
## 5 5 "n" 
## 6 6 "n" 
## 7 7 "n" 
## 8 8 "n" 
## 9 9 "n" 
## 10 10 "n" 
## 11 11 "n" 
## 12 12 "n" 
## 13 13 "n" 
## 14 14 "n"

Boom: no text.

You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

add a comment |

So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.

When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:

pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text 
## <int> <chr>
## 1 1 "n" 
## 2 2 "n" 
## 3 3 "n" 
## 4 4 "n" 
## 5 5 "n" 
## 6 6 "n" 
## 7 7 "n" 
## 8 8 "n" 
## 9 9 "n" 
## 10 10 "n" 
## 11 11 "n" 
## 12 12 "n" 
## 13 13 "n" 
## 14 14 "n"

Boom: no text.

You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

add a comment |

So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.

When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:

pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text 
## <int> <chr>
## 1 1 "n" 
## 2 2 "n" 
## 3 3 "n" 
## 4 4 "n" 
## 5 5 "n" 
## 6 6 "n" 
## 7 7 "n" 
## 8 8 "n" 
## 9 9 "n" 
## 10 10 "n" 
## 11 11 "n" 
## 12 12 "n" 
## 13 13 "n" 
## 14 14 "n"

Boom: no text.

You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.

When I use the httr::write_disk() or curl::curl_download() methods to get the PDF (boy that takes a while in the U.S., too) I then did:

pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text 
## <int> <chr>
## 1 1 "n" 
## 2 2 "n" 
## 3 3 "n" 
## 4 4 "n" 
## 5 5 "n" 
## 6 6 "n" 
## 7 7 "n" 
## 8 8 "n" 
## 9 9 "n" 
## 10 10 "n" 
## 11 11 "n" 
## 12 12 "n" 
## 13 13 "n" 
## 14 14 "n"

Boom: no text.

You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

answered Nov 14 '18 at 14:01

hrbrmstr

60.8k688150

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

add a comment |

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

I am indeed using OCR. However, tesseract calls pdf_info() on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)

– Brandon Bertelsen
Nov 14 '18 at 21:10

No, I'm 99% sure it's pdftools. If you can get rJava working pdfbox is def an alternative.

– hrbrmstr
Nov 14 '18 at 21:11

The Embedded NUL in string bug has been fixed in pdftools 2.0.

– Jeroen
Dec 12 '18 at 14:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj