Scrapy crawl and downloading particular Type files
I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.
My code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req
def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)
What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.
python scrapy web-crawler scrapy-spider
|
show 2 more comments
I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.
My code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req
def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)
What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.
python scrapy web-crawler scrapy-spider
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23
|
show 2 more comments
I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.
My code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req
def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)
What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.
python scrapy web-crawler scrapy-spider
I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.
My code
import urlparse
from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"
allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]
#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req
def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)
What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.
python scrapy web-crawler scrapy-spider
python scrapy web-crawler scrapy-spider
edited Nov 13 at 6:30
asked Nov 12 at 5:20
Revathi
63
63
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23
|
show 2 more comments
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23
|
show 2 more comments
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256317%2fscrapy-crawl-and-downloading-particular-type-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256317%2fscrapy-crawl-and-downloading-particular-type-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46
Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11
Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39
Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59
Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23