特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

scrapy（二）案例2

来源：互联网收集：自由互联发布时间：2022-10-26

我们接着说这个爬虫的工具scrapy 1.shell对象和selector对象 scrapy shell就是一个交互式的终端，作用：可以很好的调试，启动：scrapy shell url。如果url有参数，用引号把url包起来 2.选择器 s

我们接着说这个爬虫的工具scrapy

1.shell对象和selector对象

scrapy shell就是一个交互式的终端，作用：可以很好的调试，启动：scrapy shell url。

如果url有参数，用引号把url包起来

2.选择器

selector
xpath
extract：返回unicode字符串 css（此处是css选择器）
re(此处是正则)

在我们爬取数据时，数据时分开的，我这里的分开是说数据不在同一个页面，但是呢，我们保存的时候在一个文件。
这个时候，我们就使用到了本篇博客中的方法，mate。
我们可以把爬取的一个页面的值传到第二个页面，然后和第二个页面的数据一起保存。
这个方法是我目前学习到的一种，可能后期还会有更好的方法。这个其一，本篇博客的学习，

其二是写入文件的pipline方法有所改变。一开始我们学习的是打开文件写入。也就是写入多少次，打开多少次，这样会影响爬虫写入文件的性能，所以，我们有个本篇的第二种方法，就是爬虫开启，然后我们打开文件，等爬虫结束，我们关闭文件。

如下案例代码：

spider代码，我会把代码分隔开来解释

# -*- coding: utf-8 -*-
import scrapy
from ..items import TencentItem
class TencentSpider(scrapy.Spider):

这个是文件代码头，不需要过多解释。

name = 'tencent'
# allowed_domains = ['https://hr.tencent.com/position.php?&start=0#a']
start_urls = ['https://hr.tencent.com/position.php?&start=0#a']
base_url = 'https://hr.tencent.com/'

start_urls参数是必须要有的，是爬取数据的第一个页面的url，参数名称必须是start_urls。

xpath('//tr[@class="even"]/td[1]/a/text() | //tr[@class="odd"]/td[1]/a/text()').extract()
types = response.xpath('//tr[@class="even"]/td[2]/text() | //tr[@class="odd"]/td[2]/text()').extract()
nums = response.xpath('//tr[@class="even"]/td[3]/text() | //tr[@class="odd"]/td[3]/text()').extract()
address = response.xpath('//tr[@class="even"]/td[4]/text() | //tr[@class="odd"]/td[4]/text()').extract()
times = response.xpath('//tr[@class="even"]/td[5]/text() | //tr[@class="odd"]/td[5]/text()').extract()
info_urls = response.xpath('//tr[@class="even"]/td[1]/a/@href | //tr[@class="odd"]/td[1]/a/@href').extract()

response是请求完start_urls之后返回的页面结果，然后我们使用xpath来定位到自己想要的元素数据，保存到某个参数。

items = dict()
for name, type, num, addres, time, info_url in zip(names, types, nums, address, times, info_urls):
items['name'] = name
items['type'] = type
items['num'] = num
items['address'] = addres
items['time'] = time

我们把数据保存到一个字典，便于后续数据保存。

request = scrapy.Request(url=self.base_url + info_url, callback=self.info_pares)
request.meta['item'] = items#给下一个路由传值
yield request
#下一页
next_url = response.xpath('//a[@id="next"]/@href').extract_first()
if next_url:
yield scrapy.Request(url=self.base_url+next_url,callback=self.parse)

爬取页面下，分页的数据，必须要知道分页的规则，这个是请求分页的比较巧的方法，大家可以看看有没有其他的方法，评论一下。

def info_pares(self, response):
info = dict()
info['squareli'] = response.xpath('//table[@class="tablelist textl"]/tr[3]/td/ul/li/text()').extract()
info['lightblue'] = response.xpath('//table[@class="tablelist textl"]/tr[4]/td/ul/li/text()').extract()
items = response.meta['item']#接收值
item = TencentItem()
item['name'] = items['name']
item['type'] = items['type']
item['num'] = items['num']
item['address'] = items['address']
item['time'] = items['time']
item['info'] = info
yield item

接收到值，后面传给通道pip，保存到数据库或者文件。

pipelines代码：

# -*- coding: utf-8 -*-
import json
class MyspiderPipeline(object):

爬虫运行的时候打开

def open_spider(self,spider):
self.f = open('item.json','w',encoding='utf-8')

这个方法是必须实现的

open('item.json','a+',encoding='utf-8') as f:
# # f.write(json.dumps(item)+'\n')#老写发
# f.write(str(item)+'\n')#新写法
# return item
#激活才可以写入到文件，激活在settings中的67行
self.f.write(json.dumps(dict(item),ensure_ascii=False)+'\n')
return item

爬虫结束时关闭

def close_spider(self,spider):
self.f.close()