Scrapy crawl and downloading particular Type files

I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.

My code

import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response): 
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
 req = Request(url = link, callback = self.parse_page)
 yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
 link = a.extract()
 if link.endswith('.htm'):
 link = urlparse.urljoin(base_url, link)
 yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
 f.write(response.body)

What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46

Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11

Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39

Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59

Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

|
show 2 more comments

My code

import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response): 
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
 req = Request(url = link, callback = self.parse_page)
 yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
 link = a.extract()
 if link.endswith('.htm'):
 link = urlparse.urljoin(base_url, link)
 yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
 f.write(response.body)

What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46

Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11

Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39

Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59

Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

|
show 2 more comments

My code

import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response): 
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
 req = Request(url = link, callback = self.parse_page)
 yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
 link = a.extract()
 if link.endswith('.htm'):
 link = urlparse.urljoin(base_url, link)
 yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
 f.write(response.body)

What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

My code

import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response): 
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
 req = Request(url = link, callback = self.parse_page)
 yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
 link = a.extract()
 if link.endswith('.htm'):
 link = urlparse.urljoin(base_url, link)
 yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
 f.write(response.body)

What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.

python scrapy web-crawler scrapy-spider

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

edited Nov 13 at 6:30

asked Nov 12 at 5:20

Revathi

asked Nov 12 at 5:20

Revathi

asked Nov 12 at 5:20

Revathi

What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46

Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11

Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39

Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59

Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

|
show 2 more comments

What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46

Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11

Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39

Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59

Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46

Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11

Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39

Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59

Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

|
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256317%2fscrapy-crawl-and-downloading-particular-type-files%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

B1gzdCGvUaKOyvleoPOw 8fJHVU38 TWBaNAIPN,cByo,wkXle7Io BrxkVM

搜尋此網誌

Odtnhj