在爬取网页数据之前,需要对网页进行分析,不断翻页我们可以发现网页为GET请求,URL有如下规律:
'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,4.html?'
'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,3.html?'
'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,2.html?'
'https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,1.html?'
[](()数据获取
分析网页后,我们要确立爬取思路、爬取字段、使用工具等,详情如下:
-
爬取思路:先针对某一页数据的一级页面做一个解析,然后再进行二级页面做一个解析,最后再进行翻页操作;
-
爬取字段:公司名、岗位名、工作地址、薪资、发布时间、工作描述、公司类型、员工人数、所属行业;
-
使用工具:Python+requests+lxml+pandas+time;
- 网站解析方式:Xpath;
[](()相关代码
import requests
import pandas as pd
from lxml import etree
import time
import warnings
warnings.filterwarnings("ignore")
job_name = dom.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')
2、公司名称
company_name = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title')
3、工作地点
address = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()')
4、工资
salary_mid = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]')
salary = [i.text for i in salary_mid]
5、发布日期
release_time = dom.xpath('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()')
6、获取二级网址url
deep_url = dom.xpath('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')
RandomAll = []
JobDescribe = []
CompanyType = []
CompanySize = []
Industry = []
for i in range(len(deep_url)):
web_test = requests.get(deep_url[i], headers=headers)
web_test.encoding = "gbk"
dom_test = etree.HTML(web_test.text)
7、爬取经验、学历信息,先合在一个字段里面,以后再做数据清洗。命名为random_all
random_all = dom_test.xpath('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')
8、岗位描述性息
job_describe = dom_test.xpath('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()')
9、公司类型
company_type = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title')
10、公司规模(人数)
company_size = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title')
11、所属行业(公司)
industry = dom_test.xpath('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@title')