pdftools: Embeded NUL in string
I'm trying to download a file and read it's info automatically, from the following link:
http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf
The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.
library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
Embedded NUL in string.
What else I've tried:
- Tried downloading using mode = "wb"
- Tried downloading with httr using the write_disk method
- Tried downloading manually on windows and it works! :(
My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.
r poppler
add a comment |
I'm trying to download a file and read it's info automatically, from the following link:
http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf
The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.
library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
Embedded NUL in string.
What else I've tried:
- Tried downloading using mode = "wb"
- Tried downloading with httr using the write_disk method
- Tried downloading manually on windows and it works! :(
My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.
r poppler
add a comment |
I'm trying to download a file and read it's info automatically, from the following link:
http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf
The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.
library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
Embedded NUL in string.
What else I've tried:
- Tried downloading using mode = "wb"
- Tried downloading with httr using the write_disk method
- Tried downloading manually on windows and it works! :(
My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.
r poppler
I'm trying to download a file and read it's info automatically, from the following link:
http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf
The problem is that when I try to read the information on the pdf, I get an error. It seems to happen on and off, I can't see a good reason why. The error appears to be Linux only.
library(pdftools)
link = "http://www.leyes.congreso.gob.pe/Documentos/2016_2021/Proyectos_de_Ley_y_de_Resoluciones_Legislativas/PL0361420181108.pdf"
download.file(link, "somefile.pdf")
pdf_info("somefile.pdf")
Error in poppler_pdf_info(loadfile(pdf), opw, upw) :
Embedded NUL in string.
What else I've tried:
- Tried downloading using mode = "wb"
- Tried downloading with httr using the write_disk method
- Tried downloading manually on windows and it works! :(
My suspicion is that it has to do with the way I'm downloading the file. But, I don't know what alternatives I should be trying.
r poppler
r poppler
edited Nov 14 '18 at 4:26
Brandon Bertelsen
asked Nov 14 '18 at 4:16
Brandon BertelsenBrandon Bertelsen
25.4k27123228
25.4k27123228
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.
If rJava
works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox
(note the security warning there as I haven't updated the pdfbox
JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools
.
When I use the httr::write_disk()
or curl::curl_download()
methods to get the PDF (boy that takes a while in the U.S., too) I then did:
pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text
## <int> <chr>
## 1 1 "n"
## 2 2 "n"
## 3 3 "n"
## 4 4 "n"
## 5 5 "n"
## 6 6 "n"
## 7 7 "n"
## 8 8 "n"
## 9 9 "n"
## 10 10 "n"
## 11 11 "n"
## 12 12 "n"
## 13 13 "n"
## 14 14 "n"
Boom: no text.
You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.
I am indeed using OCR. However,tesseract
callspdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)
– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it'spdftools
. If you can getrJava
workingpdfbox
is def an alternative.
– hrbrmstr
Nov 14 '18 at 21:11
TheEmbedded NUL in string
bug has been fixed in pdftools 2.0.
– Jeroen
Dec 12 '18 at 14:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293124%2fpdftools-embeded-nul-in-string%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.
If rJava
works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox
(note the security warning there as I haven't updated the pdfbox
JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools
.
When I use the httr::write_disk()
or curl::curl_download()
methods to get the PDF (boy that takes a while in the U.S., too) I then did:
pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text
## <int> <chr>
## 1 1 "n"
## 2 2 "n"
## 3 3 "n"
## 4 4 "n"
## 5 5 "n"
## 6 6 "n"
## 7 7 "n"
## 8 8 "n"
## 9 9 "n"
## 10 10 "n"
## 11 11 "n"
## 12 12 "n"
## 13 13 "n"
## 14 14 "n"
Boom: no text.
You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.
I am indeed using OCR. However,tesseract
callspdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)
– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it'spdftools
. If you can getrJava
workingpdfbox
is def an alternative.
– hrbrmstr
Nov 14 '18 at 21:11
TheEmbedded NUL in string
bug has been fixed in pdftools 2.0.
– Jeroen
Dec 12 '18 at 14:59
add a comment |
So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.
If rJava
works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox
(note the security warning there as I haven't updated the pdfbox
JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools
.
When I use the httr::write_disk()
or curl::curl_download()
methods to get the PDF (boy that takes a while in the U.S., too) I then did:
pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text
## <int> <chr>
## 1 1 "n"
## 2 2 "n"
## 3 3 "n"
## 4 4 "n"
## 5 5 "n"
## 6 6 "n"
## 7 7 "n"
## 8 8 "n"
## 9 9 "n"
## 10 10 "n"
## 11 11 "n"
## 12 12 "n"
## 13 13 "n"
## 14 14 "n"
Boom: no text.
You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.
I am indeed using OCR. However,tesseract
callspdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)
– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it'spdftools
. If you can getrJava
workingpdfbox
is def an alternative.
– hrbrmstr
Nov 14 '18 at 21:11
TheEmbedded NUL in string
bug has been fixed in pdftools 2.0.
– Jeroen
Dec 12 '18 at 14:59
add a comment |
So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.
If rJava
works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox
(note the security warning there as I haven't updated the pdfbox
JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools
.
When I use the httr::write_disk()
or curl::curl_download()
methods to get the PDF (boy that takes a while in the U.S., too) I then did:
pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text
## <int> <chr>
## 1 1 "n"
## 2 2 "n"
## 3 3 "n"
## 4 4 "n"
## 5 5 "n"
## 6 6 "n"
## 7 7 "n"
## 8 8 "n"
## 9 9 "n"
## 10 10 "n"
## 11 11 "n"
## 12 12 "n"
## 13 13 "n"
## 14 14 "n"
Boom: no text.
You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.
So, this isn't going to work at all. There is no text in that document except for page break characters. It's all images.
If rJava
works on your system and you can and are comfortable installing packages from untrusted sources such as GitHub them you can install pdfbox
(note the security warning there as I haven't updated the pdfbox
JARs but the only vuln is a potential process denial of service) to validate this since it's less fragile than pdftools
.
When I use the httr::write_disk()
or curl::curl_download()
methods to get the PDF (boy that takes a while in the U.S., too) I then did:
pdfbox::extract_text("~/Downloads/ill-bet-this-is-all-images.pdf")
## # A tibble: 14 x 2
## page text
## <int> <chr>
## 1 1 "n"
## 2 2 "n"
## 3 3 "n"
## 4 4 "n"
## 5 5 "n"
## 6 6 "n"
## 7 7 "n"
## 8 8 "n"
## 9 9 "n"
## 10 10 "n"
## 11 11 "n"
## 12 12 "n"
## 13 13 "n"
## 14 14 "n"
Boom: no text.
You'll need to use some of the rOpenSci image-to-text OCR tools to get anything meaningful out of that document.
answered Nov 14 '18 at 14:01
hrbrmstrhrbrmstr
60.8k688150
60.8k688150
I am indeed using OCR. However,tesseract
callspdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)
– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it'spdftools
. If you can getrJava
workingpdfbox
is def an alternative.
– hrbrmstr
Nov 14 '18 at 21:11
TheEmbedded NUL in string
bug has been fixed in pdftools 2.0.
– Jeroen
Dec 12 '18 at 14:59
add a comment |
I am indeed using OCR. However,tesseract
callspdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)
– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it'spdftools
. If you can getrJava
workingpdfbox
is def an alternative.
– hrbrmstr
Nov 14 '18 at 21:11
TheEmbedded NUL in string
bug has been fixed in pdftools 2.0.
– Jeroen
Dec 12 '18 at 14:59
I am indeed using OCR. However,
tesseract
calls pdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)– Brandon Bertelsen
Nov 14 '18 at 21:10
I am indeed using OCR. However,
tesseract
calls pdf_info()
on the document before going through the ocr routine, probably because it needs to know how many pages are in the document. I simplified the problem to this particular error as I thought it was related to how I was downloading (given it's intermittent nature)– Brandon Bertelsen
Nov 14 '18 at 21:10
No, I'm 99% sure it's
pdftools
. If you can get rJava
working pdfbox
is def an alternative.– hrbrmstr
Nov 14 '18 at 21:11
No, I'm 99% sure it's
pdftools
. If you can get rJava
working pdfbox
is def an alternative.– hrbrmstr
Nov 14 '18 at 21:11
The
Embedded NUL in string
bug has been fixed in pdftools 2.0.– Jeroen
Dec 12 '18 at 14:59
The
Embedded NUL in string
bug has been fixed in pdftools 2.0.– Jeroen
Dec 12 '18 at 14:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293124%2fpdftools-embeded-nul-in-string%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown