当前位置 : 主页 > 编程语言 > python >

如何整合Flask&Scrapy?

来源:互联网 收集:自由互联 发布时间:2022-06-18
文章来源:如何整合FlaskScrapy? - 代码领悟code05.com 提问:如何整合FlaskScrapy? 我正在使用scrapy来获取数据,我想使用flask web框架在网页中显示结果。 但不知道如何调用flask应用程序中的

文章来源:如何整合Flask&Scrapy? - 代码领悟code05.com

提问:如何整合Flask&Scrapy?

我正在使用scrapy来获取数据,我想使用flask web框架在网页中显示结果。 但不知道如何调用flask应用程序中的蜘蛛。 我已经尝试使用CrawlerProcess来调用我的蜘蛛,但我得到了这样的错误:

ValueError ValueError: signal only works in main thread Traceback (most recent call last) File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__ return self.wsgi_app(environ, start_response) File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app response = self.make_response(self.handle_exception(e)) File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception reraise(exc_type, exc_value, tb) File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app response = self.full_dispatch_request() File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request rv = self.handle_user_exception(e) File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception reraise(exc_type, exc_value, tb) File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index process = CrawlerProcess() File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__ install_shutdown_handlers(self._signal_shutdown) File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers reactor._handleSignals() File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals _SignalReactorMixin._handleSignals(self) File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals signal.signal(signal.SIGINT, self.sigInt) ValueError: signal only works in main thread

我的scrapy代码是这样的:

class EPGD(Item): genID = Field() genID_url = Field() taxID = Field() taxID_url = Field() familyID = Field() familyID_url = Field() chromosome = Field() symbol = Field() description = Field() class EPGD_spider(Spider): name = "EPGD" allowed_domains = ["epgd.biosino.org"] term = "man" start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"] db = DB_Con() collection = db.getcollection(name, term) def parse(self, response): sel = Selector(response) sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]') url_list = [] base_url = "http://epgd.biosino.org/EPGD" for site in sites: item = EPGD() item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract()) item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:] item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract()) item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract()) item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract()) item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:] item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract()) item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract()) item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract()) self.collection.update({"genID":item['genID']}, dict(item), upsert=True) yield item sel_tmp = Selector(response) link = sel_tmp.xpath('//span[@id="quickPage"]') for site in link: url_list.append(site.xpath('a/@href').extract()) for i in range(len(url_list[0])): if cmp(url_list[0][i], "#") == 0: if i+1 < len(url_list[0]): print url_list[0][i+1] actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1] yield Request(actual_url, callback=self.parse) break else: print "The index is out of range!"

我的flask代码是这样的:

@app.route('/', methods=['GET', 'POST']) def index(): process = CrawlerProcess() process.crawl(EPGD_spider) return redirect(url_for('details')) @app.route('/details', methods = ['GET']) def epgd(): if request.method == 'GET': results = db['EPGD_test'].find() json_results= [] for result in results: json_results.append(result) return toJson(json_results)

在使用flask web框架时,如何调用我的scrapy spiders?

回答:

在你的蜘蛛前面添加HTTP服务器并不那么容易。 有几个选择。

1. Python子进程

如果你真的局限于Flask,如果你不能使用其他任何东西,那么将Scrapy与Flask集成的唯一方法是为每个蜘蛛抓取启动外部进程,正如其他答案所建议的那样(请注意,你的子进程需要在适当的Scrapy项目目录中产生)。

所有示例的目录结构应该是这样的,我正在使用dirbot测试项目

> tree -L 1 ├── dirbot ├── README.rst ├── scrapy.cfg ├── server.py └── setup.py

下面是在新进程中启动Scrapy的代码示例:

# server.py import subprocess from flask import Flask app = Flask(__name__) @app.route('/') def hello_world(): """ Run spider in another process and store items in file. Simply issue command: > scrapy crawl dmoz -o "output.json" wait for this command to finish, and read output.json to client. """ spider_name = "dmoz" subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"]) with open("output.json") as items_file: return items_file.read() if __name__ == '__main__': app.run(debug=True)

以上保存为server.py 并访问localhost:5000,您应该能够看到刮掉的项目。

2. 扭曲-克莱因+Scrapy

其他更好的方法是使用一些现有的项目,该项目将Twisted与Werkzeug集成,并显示类似于Flask的API,例如Twisted-Klein。 Twisted-Klein将允许您在与web服务器相同的进程中异步运行蜘蛛。 更好的是它不会阻止每个请求,它允许您简单地从HTTP路由请求处理程序返回Scrapy/Twisted deferreds。

以下snippet将Twisted-Klein与Scrapy集成在一起,请注意,您需要创建自己的CrawlerRunner基类,以便crawler将收集项目并将其返回给caller。 这个选项更高级一点,你正在运行Scrapy蜘蛛与Python服务器相同的进程,项目不存储在文件中,而是存储在内存中(所以没有磁盘写入/读取,如前面的例子)。 最重要的是它是异步的,它都在一个扭曲的反应器中运行。

# server.py import json from klein import route, run from scrapy import signals from scrapy.crawler import CrawlerRunner from dirbot.spiders.dmoz import DmozSpider class MyCrawlerRunner(CrawlerRunner): """ Crawler object that collects items and returns output after finishing crawl. """ def crawl(self, crawler_or_spidercls, *args, **kwargs): # keep all items scraped self.items = [] # create crawler (Same as in base CrawlerProcess) crawler = self.create_crawler(crawler_or_spidercls) # handle each item scraped crawler.signals.connect(self.item_scraped, signals.item_scraped) # create Twisted.Deferred launching crawl dfd = self._crawl(crawler, *args, **kwargs) # add callback - when crawl is done cal return_items dfd.addCallback(self.return_items) return dfd def item_scraped(self, item, response, spider): self.items.append(item) def return_items(self, result): return self.items def return_spider_output(output): """ :param output: items scraped by CrawlerRunner :return: json with list of items """ # this just turns items into dictionaries # you may want to use Scrapy JSON serializer here return json.dumps([dict(item) for item in output]) @route("/") def schedule(request): runner = MyCrawlerRunner() spider = DmozSpider() deferred = runner.crawl(spider) deferred.addCallback(return_spider_output) return deferred run("localhost", 8080)

以上保存在文件中server.py 并在您的Scrapy项目目录中找到它, 现在打开localhost:8080,它将启动dmoz spider并将抓取的项目作为json返回到浏览器。

3. [医]刮锥器

当您尝试在蜘蛛面前添加HTTP应用程序时,会出现一些问题。 例如,您有时需要处理蜘蛛日志(在某些情况下可能需要它们),您需要以某种方式处理蜘蛛异常等。 有些项目允许您以更简单的方式向蜘蛛添加HTTP API,例如ScrapyRT。 这是一个应用程序,它将HTTP服务器添加到您的Scrapy蜘蛛中,并为您处理所有问题(例如处理日志记录,处理蜘蛛错误等)。

所以安装ScrapyRT后你只需要做:

> scrapyrt

在您的Scrapy项目目录中,它将为您启动http服务器侦听请求。 然后你访问http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com它应该启动你的蜘蛛为你爬行的url给出。

免责声明:我是ScrapyRt的作者之一。

网友评论