Scrapy crawl and downloading particular Type files










1














I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.



My code



import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)


What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.










share|improve this question























  • What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
    – Biswanath
    Nov 12 at 6:46










  • Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
    – Revathi
    Nov 12 at 7:11










  • Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
    – Biswanath
    Nov 12 at 7:39










  • Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
    – Revathi
    Nov 12 at 7:59











  • Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
    – Biswanath
    Nov 12 at 8:23















1














I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.



My code



import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)


What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.










share|improve this question























  • What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
    – Biswanath
    Nov 12 at 6:46










  • Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
    – Revathi
    Nov 12 at 7:11










  • Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
    – Biswanath
    Nov 12 at 7:39










  • Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
    – Revathi
    Nov 12 at 7:59











  • Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
    – Biswanath
    Nov 12 at 8:23













1












1








1


1





I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.



My code



import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)


What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.










share|improve this question















I'm very new to this, My Task is : Let me tell what exactly I need, I want to scrapy to search and download for some contract which TYPE is EX-10.1, EX-10.2, etc., upto EX-10.99. Contracts available in .htm format and txt format.
1. How to scrap the links and download.
2. How to filter the file types and download.
Scrapy wants to go through this path and download, Path : url->Enter into each CIK links -> search and Download the EX-10 Type files.



My code



import urlparse

from scrapy.http import Request
from scrapy.spiders import BaseSpider
class legco(BaseSpider):
name = "sec_gov"

allowed_domains = ["www.sec.gov", "search.usa.gov", "secsearch.sec.gov"]
start_urls = ["https://www.sec.gov/cgi-bin/browse-edgar?company=&match=&CIK=&filenum=&State=&Country=&SIC=2834&owner=exclude&Find=Find+Companies&action=getcompany"]


#extract search results
def parse(self, response):
for link in response.xpath('//div[@id="seriesDiv"]//table[@class="tableFile2"]/a/@href').extract():
req = Request(url = link, callback = self.parse_page)
yield req

def parse(self, response):
base_url = 'http://www.sec.gov/cgi-bin/browse-edgar'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
if link.endswith('.htm'):
link = urlparse.urljoin(base_url, link)
yield Request(link, callback = self.save_pdf)

def save_pdf(self, response):
path = response.url.split('/')[-1]
with open(path, 'wb') as f:
f.write(response.body)


What changes I have to do? Can anyone help me with this Issue please. Thank you in Advance.







python scrapy web-crawler scrapy-spider






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 at 6:30

























asked Nov 12 at 5:20









Revathi

63




63











  • What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
    – Biswanath
    Nov 12 at 6:46










  • Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
    – Revathi
    Nov 12 at 7:11










  • Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
    – Biswanath
    Nov 12 at 7:39










  • Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
    – Revathi
    Nov 12 at 7:59











  • Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
    – Biswanath
    Nov 12 at 8:23
















  • What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
    – Biswanath
    Nov 12 at 6:46










  • Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
    – Revathi
    Nov 12 at 7:11










  • Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
    – Biswanath
    Nov 12 at 7:39










  • Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
    – Revathi
    Nov 12 at 7:59











  • Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
    – Biswanath
    Nov 12 at 8:23















What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46




What behaviour you are seeing from the above code ? As in, is it erring out ? If so please add the error to your question. Or is it working fine but you are getting an empty file ? Thank you and welcome to SO.
– Biswanath
Nov 12 at 6:46












Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11




Thanks for your Answer Biswanath, In this while am running this code, It downloading the current page .hmt files, but i want scrapy to goto next page and download the files.
– Revathi
Nov 12 at 7:11












Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39




Interesting, the above posted code does not even work for me. May be you posted the wrong set of code ?
– Biswanath
Nov 12 at 7:39












Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59





Yes, I'm sorry, I have updated the code, Can you try now. Its downloading for this current page, Once I change the URL(previous & over all search page) its not getting to this page. can you help me with this.?
– Revathi
Nov 12 at 7:59













Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23




Was there a reasone parse_page there previously. As currently your code is trying to download all htm files instead of files reachable by using the documentbutton ?
– Biswanath
Nov 12 at 8:23

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256317%2fscrapy-crawl-and-downloading-particular-type-files%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53256317%2fscrapy-crawl-and-downloading-particular-type-files%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Barbados

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3