同时运行多个Scrapy爬虫的方法

一个项目有时有多个爬虫, 可以自定义项目名录达到一次运行多个爬虫的目的.

scrapy list可以查看当前项目下有几个爬虫

  1. 创建commands目录
    在项目根目录下创建commands目录
  2. 创建crawlall.py文件
    代码如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    from scrapy.commands import ScrapyCommand ,UsageError
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.conf import arglist_to_dict
    class Command(ScrapyCommand):
    requires_project = True
    def syntax(self):
    return '[options]'
    def short_desc(self):
    return 'Runs all of the spiders'
    def add_options(self, parser):
    ScrapyCommand.add_options(self, parser)
    parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
    help="set spider argument (may be repeated)")
    parser.add_option("-o", "--output", metavar="FILE",
    help="dump scraped items into FILE (use - for stdout)")
    parser.add_option("-t", "--output-format", metavar="FORMAT",
    help="format to use for dumping items with -o")
    def process_options(self, args, opts):
    ScrapyCommand.process_options(self, args, opts)
    try:
    opts.spargs = arglist_to_dict(opts.spargs)
    except ValueError:
    raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)
    def run(self, args, opts):
    #settings = get_project_settings()
    spider_loader = self.crawler_process.spider_loader
    for spidername in args or spider_loader.list():
    print ("*********cralall spidername************" + spidername)
    self.crawler_process.crawl(spidername, **opts.spargs)
    self.crawler_process.start()

    这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下所有的spider,然后利用self.crawler_process.crawl运行spider

  3. 创建__init__.py文件
  4. settings.py中添加配置:COMMANDS_MODULE = 'cnblogs.commands'
  5. 在命令行中输入scrapy crawlall即可

参考自:同时运行多个scrapy爬虫的几种方法(自定义scrapy项目命令)