Implementing scrapy rules by overriding CrawlSpider __init__() method
up vote
1
down vote
favorite
I'm trying to override the init() method of a CrawlSpider in order to be able to pass domain name and start page. However, I can't seem to pass the rules in.
I have tried the approach suggested here(Scrapy: Rules set inside __init__ are ignored by CrawlSpider), and defined rules before the super() method, but it doesn't seem to work.
Here is my spider:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SomeSpider(CrawlSpider):
name = 'some_s'
def __init__(self, *args, **kwargs):
self.allowed_domains = kwargs.get('FIRST_DOMAIN')[1:-1]
self.start_urls = [kwargs.get('FIRST_PAGE')[1:-1]]
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(SomeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
i =
i['url'] = response.url
return i
I pass these values to the terminal, but it stops at the first page:
$scrapy crawl some_s -a FIRST_PAGE='https://www.wikipedia.org/' -a FIRST_DOMAIN='wikipedia.org'
this is the log:
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: foo)
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-11-10 14:07:26 [scrapy.crawler] INFO: Overridden settings: 'BOT_NAME': 'foo', 'NEWSPIDER_MODULE': 'foo.spiders', 'SPIDER_MODULES': ['foo.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'foo.middlewares.FooDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled item pipelines:
['foo.pipelines.FooPipeline']
2018-11-10 14:07:26 [scrapy.core.engine] INFO: Spider opened
2018-11-10 14:07:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-10 14:07:26 [some_s] INFO: Spider opened: some_s
2018-11-10 14:07:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-10 14:07:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wikipedia.org/> (referer: None)
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET https://en.wikipedia.org/>
.
.
.
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'creativecommons.org': <GET https://creativecommons.org/licenses/by-sa/3.0/>
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-10 14:07:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'downloader/request_bytes': 260,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19485,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 10, 13, 7, 27, 704476),
'log_count/DEBUG': 298,
'log_count/INFO': 8,
'offsite/domains': 296,
'offsite/filtered': 310,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 10, 13, 7, 26, 767970)
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Spider closed (finished)
python scrapy scrapy-spider super
add a comment |
up vote
1
down vote
favorite
I'm trying to override the init() method of a CrawlSpider in order to be able to pass domain name and start page. However, I can't seem to pass the rules in.
I have tried the approach suggested here(Scrapy: Rules set inside __init__ are ignored by CrawlSpider), and defined rules before the super() method, but it doesn't seem to work.
Here is my spider:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SomeSpider(CrawlSpider):
name = 'some_s'
def __init__(self, *args, **kwargs):
self.allowed_domains = kwargs.get('FIRST_DOMAIN')[1:-1]
self.start_urls = [kwargs.get('FIRST_PAGE')[1:-1]]
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(SomeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
i =
i['url'] = response.url
return i
I pass these values to the terminal, but it stops at the first page:
$scrapy crawl some_s -a FIRST_PAGE='https://www.wikipedia.org/' -a FIRST_DOMAIN='wikipedia.org'
this is the log:
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: foo)
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-11-10 14:07:26 [scrapy.crawler] INFO: Overridden settings: 'BOT_NAME': 'foo', 'NEWSPIDER_MODULE': 'foo.spiders', 'SPIDER_MODULES': ['foo.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'foo.middlewares.FooDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled item pipelines:
['foo.pipelines.FooPipeline']
2018-11-10 14:07:26 [scrapy.core.engine] INFO: Spider opened
2018-11-10 14:07:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-10 14:07:26 [some_s] INFO: Spider opened: some_s
2018-11-10 14:07:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-10 14:07:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wikipedia.org/> (referer: None)
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET https://en.wikipedia.org/>
.
.
.
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'creativecommons.org': <GET https://creativecommons.org/licenses/by-sa/3.0/>
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-10 14:07:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'downloader/request_bytes': 260,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19485,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 10, 13, 7, 27, 704476),
'log_count/DEBUG': 298,
'log_count/INFO': 8,
'offsite/domains': 296,
'offsite/filtered': 310,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 10, 13, 7, 26, 767970)
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Spider closed (finished)
python scrapy scrapy-spider super
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
I'm trying to override the init() method of a CrawlSpider in order to be able to pass domain name and start page. However, I can't seem to pass the rules in.
I have tried the approach suggested here(Scrapy: Rules set inside __init__ are ignored by CrawlSpider), and defined rules before the super() method, but it doesn't seem to work.
Here is my spider:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SomeSpider(CrawlSpider):
name = 'some_s'
def __init__(self, *args, **kwargs):
self.allowed_domains = kwargs.get('FIRST_DOMAIN')[1:-1]
self.start_urls = [kwargs.get('FIRST_PAGE')[1:-1]]
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(SomeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
i =
i['url'] = response.url
return i
I pass these values to the terminal, but it stops at the first page:
$scrapy crawl some_s -a FIRST_PAGE='https://www.wikipedia.org/' -a FIRST_DOMAIN='wikipedia.org'
this is the log:
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: foo)
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-11-10 14:07:26 [scrapy.crawler] INFO: Overridden settings: 'BOT_NAME': 'foo', 'NEWSPIDER_MODULE': 'foo.spiders', 'SPIDER_MODULES': ['foo.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'foo.middlewares.FooDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled item pipelines:
['foo.pipelines.FooPipeline']
2018-11-10 14:07:26 [scrapy.core.engine] INFO: Spider opened
2018-11-10 14:07:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-10 14:07:26 [some_s] INFO: Spider opened: some_s
2018-11-10 14:07:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-10 14:07:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wikipedia.org/> (referer: None)
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET https://en.wikipedia.org/>
.
.
.
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'creativecommons.org': <GET https://creativecommons.org/licenses/by-sa/3.0/>
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-10 14:07:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'downloader/request_bytes': 260,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19485,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 10, 13, 7, 27, 704476),
'log_count/DEBUG': 298,
'log_count/INFO': 8,
'offsite/domains': 296,
'offsite/filtered': 310,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 10, 13, 7, 26, 767970)
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Spider closed (finished)
python scrapy scrapy-spider super
I'm trying to override the init() method of a CrawlSpider in order to be able to pass domain name and start page. However, I can't seem to pass the rules in.
I have tried the approach suggested here(Scrapy: Rules set inside __init__ are ignored by CrawlSpider), and defined rules before the super() method, but it doesn't seem to work.
Here is my spider:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class SomeSpider(CrawlSpider):
name = 'some_s'
def __init__(self, *args, **kwargs):
self.allowed_domains = kwargs.get('FIRST_DOMAIN')[1:-1]
self.start_urls = [kwargs.get('FIRST_PAGE')[1:-1]]
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(SomeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
i =
i['url'] = response.url
return i
I pass these values to the terminal, but it stops at the first page:
$scrapy crawl some_s -a FIRST_PAGE='https://www.wikipedia.org/' -a FIRST_DOMAIN='wikipedia.org'
this is the log:
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: foo)
2018-11-10 14:07:26 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:06:47) [MSC v.1914 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-11-10 14:07:26 [scrapy.crawler] INFO: Overridden settings: 'BOT_NAME': 'foo', 'NEWSPIDER_MODULE': 'foo.spiders', 'SPIDER_MODULES': ['foo.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'foo.middlewares.FooDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-10 14:07:26 [scrapy.middleware] INFO: Enabled item pipelines:
['foo.pipelines.FooPipeline']
2018-11-10 14:07:26 [scrapy.core.engine] INFO: Spider opened
2018-11-10 14:07:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-10 14:07:26 [some_s] INFO: Spider opened: some_s
2018-11-10 14:07:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-11-10 14:07:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.wikipedia.org/> (referer: None)
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'en.wikipedia.org': <GET https://en.wikipedia.org/>
.
.
.
2018-11-10 14:07:27 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'creativecommons.org': <GET https://creativecommons.org/licenses/by-sa/3.0/>
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-10 14:07:27 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
'downloader/request_bytes': 260,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 19485,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 10, 13, 7, 27, 704476),
'log_count/DEBUG': 298,
'log_count/INFO': 8,
'offsite/domains': 296,
'offsite/filtered': 310,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 11, 10, 13, 7, 26, 767970)
2018-11-10 14:07:27 [scrapy.core.engine] INFO: Spider closed (finished)
python scrapy scrapy-spider super
python scrapy scrapy-spider super
asked Nov 10 at 13:14
T the shirt
206
206
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
It's probably something wrong with your allowed_domains
, make sure it's a well formed list. If I try this, it works fine:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AnimeSpider(CrawlSpider):
name = "Anime"
def __init__(self, *args, **kwargs):
self.allowed_domains = ['myanimelist.net']
self.start_urls = ['https://myanimelist.net/anime.php']
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(AnimeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
yield
'url': response.url
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
It's probably something wrong with your allowed_domains
, make sure it's a well formed list. If I try this, it works fine:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AnimeSpider(CrawlSpider):
name = "Anime"
def __init__(self, *args, **kwargs):
self.allowed_domains = ['myanimelist.net']
self.start_urls = ['https://myanimelist.net/anime.php']
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(AnimeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
yield
'url': response.url
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
add a comment |
up vote
0
down vote
accepted
It's probably something wrong with your allowed_domains
, make sure it's a well formed list. If I try this, it works fine:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AnimeSpider(CrawlSpider):
name = "Anime"
def __init__(self, *args, **kwargs):
self.allowed_domains = ['myanimelist.net']
self.start_urls = ['https://myanimelist.net/anime.php']
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(AnimeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
yield
'url': response.url
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
It's probably something wrong with your allowed_domains
, make sure it's a well formed list. If I try this, it works fine:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AnimeSpider(CrawlSpider):
name = "Anime"
def __init__(self, *args, **kwargs):
self.allowed_domains = ['myanimelist.net']
self.start_urls = ['https://myanimelist.net/anime.php']
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(AnimeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
yield
'url': response.url
It's probably something wrong with your allowed_domains
, make sure it's a well formed list. If I try this, it works fine:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class AnimeSpider(CrawlSpider):
name = "Anime"
def __init__(self, *args, **kwargs):
self.allowed_domains = ['myanimelist.net']
self.start_urls = ['https://myanimelist.net/anime.php']
self.rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
super(AnimeSpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
yield
'url': response.url
answered Nov 10 at 14:19
Guillaume
7191524
7191524
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
add a comment |
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
Right you are, after changing it to: self.allowed_domains = [kwargs.get('FIRST_DOMAIN')] it worked. Thanks
– T the shirt
Nov 10 at 14:39
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53239306%2fimplementing-scrapy-rules-by-overriding-crawlspider-init-method%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown