特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

当前位置 : 主页 > 编程语言 > python >

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规

来源：互联网收集：自由互联发布时间：2022-09-02

大家好，我是皮皮。一、前言前几天在Python钻石交流群【海南菜同学】问了一个Python网络爬虫的问题，下图是截图：代码初步看上去好像没啥问题，但是结果就是不对。代码

大家好，我是皮皮。

一、前言

前几天在Python钻石交流群【海南菜同学】问了一个Python网络爬虫的问题，下图是截图：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_数据

代码初步看上去好像没啥问题，但是结果就是不对。

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_Python网络爬虫_02

代码如下：

url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url,headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
print(type(price),len(price))
# print(price)
for i in price:
print(i.strip())

现在他是想让数据显示好看一些，将换行符什么的去除。

二、实现过程

这里【dcpeng】给了一份代码，如下所示：

url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url,headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
print(type(price),len(price))
# print(price)
for i in price:
print(i.strip().replace('\n'，''))

不过运行之后，效果不太大。因为它本身有些数据是空的，所以有的数据会显示空，现在他是想将空的数据，直接去除。这里【甯同学】给了一份代码，如下所示：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_html_03

不过看上去还是有些问题，12件商品，怎么才出了8个价格，少了一列了。

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_Python网络爬虫_04

后来才知道，原来不只是这个地方的问题，源头是选择器Xpath提取规则写的有问题。这里【甯同学】给了一份代码，如下图所示：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_网络爬虫_05

后来粉丝就顺利的解决了，代码如下所示：

import requests
from lxml import etree
url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath('//div[@class="product_price"]/text()')
print(type(price),len(price))
# print(price)
for i in map(str.strip,filter(str.strip,price)):
print(i)

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_python_06 上面的map函数可能有点难以理解，这里解析如下：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_Python网络爬虫_07

解决这个问题的方法还是很多的，这里再给出几个方法，这里【dcpeng】给了一份代码，如下所示：

resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li")
print(type(price), len(price))
for i in price:
print(i.xpath('./div[@class="product_price"]/text()')[1].strip())

结果如下图所示：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_数据_08

【dcpeng】还提供了另外一个方法，如下所示：

resp = requests.get(url, headers=headers)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='product_price']/text()")
print(type(price), len(price))
for i in price:
if i.strip():
print(i.replace('\n', '').strip())

结果如下图所示：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_html_09

后来【瑜亮老师】也提供了一个代码，代码如下所示：

url = "http://zw.hainan.gov.cn/wssc/ec/jlyhnkj.html"
resp = requests.get(url)
text = resp.text
parse = etree.HTML(text)
price = parse.xpath("//div[@class='productlist']/ul/li/div[4]/text()")
# 直接使用列表推导式，去掉冗余数据
price = [i.strip() for i in price if i.strip()]
print(price)
# 为了方便统计，再去掉￥符号，再转换成数字
# price = [int(float(i.replace('¥', '').replace(',', ''))) for i in price]
# 或者用re.sub去掉多余符号，再转换成数字，上下两种方法，选一个就行
# 需要import re
# price = [int(float(re.sub(r'[¥,]', '', i))) for i in price]
print(price)

结果如图所示：

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规则不对?_数据_10

方法多多！

三、总结

大家好，我是皮皮。这篇文章主要盘点了一个Python网络爬虫的问题，文中针对该问题给出了具体的解析和代码实现，帮助粉丝顺利解决了问题。

最后感谢粉丝【海南菜同学】提问，感谢【dcpeng】、【甯同学】、【瑜亮老师】给出的思路和代码解析，感谢【Engineer】、【此类生物】、【皮皮】、【心田有垢生荒草】等人参与学习交流。

上一篇：【小程序项目开发-- 京东商城】uni-app之商品列表页面（下）
下一篇：没有了

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？ 是不是XPath 的规

一、前言

二、实现过程

三、总结

相关文章

#yyds干货盘点#Python网络爬虫为何获得的内容，有好多无用的？是不是XPath 的规