特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

python re,bs4,xpath模块用法

来源：互联网收集：自由互联发布时间：2022-06-18

正则表达式 . 匹配除换行符外的任意字符 \w 匹配字母，数字，下划线 \d 匹配数字 \s 匹配任意的空白符 a|b 匹配a或b字符 ^ 匹配开头 $ 匹配结尾（）括号内为整体 [...] 匹配中括号立马的

正则表达式

. 匹配除换行符外的任意字符
\w 匹配字母，数字，下划线
\d 匹配数字
\s 匹配任意的空白符
a|b 匹配a或b字符
^ 匹配开头
$ 匹配结尾
（）括号内为整体
[...] 匹配中括号立马的字符
[^...] 取反

--------
* 前面的字符出现零次或更多次
+ 前面的字符至少出现一次
？前面的字符最多出现一次
{n,m} 前面的字符出现n-m次
{n,} 前面的字符至少出现n次
{n} 前面的字符出现n次

-----
贪婪匹配和惰性匹配

.* 贪婪匹配
*? 惰性匹配（在re模块用的最多）

re模块

findall()

匹配字符串中所有符合正则的内容，返回一个列表

import re

message = "我的电话号码是10086，座机86001"
list1 = re.findall(r"\d+", message)
print(list1)

['10086', '86001']

finditer()

匹配字符中所有的内容【返回的是迭代器】从迭代器中拿内容用group（）

import re

message = "我的电话号码是10086，座机86001"
a = re.finditer(r"\d+", message)

for i in a:
print(i)
print(i.group()) 通过group()取匹配到的值

<re.Match object; span=(7, 12), match='10086'>
10086
<re.Match object; span=(15, 20), match='86001'>
86001

search（）

匹配到就返回，不会全文检索，返回的是一个迭代器，也是用group（）

没有匹配到返回None

import re

message = "我的电话号码是10086，座机86001"

b = re.search(r"\d+", message)
print(b.group())

10086

match（）

从头开始匹配，返回一个迭代器，也是用group（）取值

import re

message = "123我的电话号码是10086，座机86001"

c = re.match(r"\d+", message) #只要字符第一个不是数字，则报错，等价于^\d
print(c.group())

123

re预加载正则表达式

作用是正则也是反复用，不需要重复写正则

import re

message = "123我的电话号码是10086，座机86001"

obj = re.compile(r"\d+")

a = obj.finditer(message)

for i in a:
print(i.group())

10086
86001

实践

import re

s = """
<dir class="zhang"><span id="1">张三</dir>
<dir class="li"><span id="2">李四</dir>
<dir class="wang"><span id="3">王五</dir>
<dir class="hui"><span id="4">呵呵</dir>

"""

obj = re.compile(r'<dir class=".*?"><span id=".*?">.*?</dir>', re.S) # re.S 让.能匹配换行符

a = obj.finditer(s)
for i in a:
print(i.group())

obj1 = re.compile(r'<dir class="(?P<hh>.*?)"><span id="(?P<ee>.*?)">(?P<xx>.*?)</dir>', re.S) # re.S 让.能匹配换行符

b = obj1.finditer(s)

for i in b:
print(i.group("hh"))
print(i.group("ee"))
print(i.group("xx"))

-----------
<dir class="zhang"><span id="1">张三</dir>
<dir class="li"><span id="2">李四</dir>
<dir class="wang"><span id="3">王五</dir>
<dir class="hui"><span id="4">呵呵</dir>
zhang
1
张三
li
2
李四
wang
3
王五
hui
4
呵呵

bs4

import bs4
from bs4 import BeautifulSoup
import requests
import csv

f = open("菜价.csv", "w")
csvwrite = csv.writer(f)
csvwrite.writerow("hahahaha")
f.close()

url = 'http://www.xinfadi.com.cn/index.html'
res = requests.get(url).text
print(res)

page = bs4.BeautifulSoup(res, "html.parser") # 指定html解析器

#从bs4对象中解析数据

#find("标签", 属性="值") # 检索到即返回，不会全文检索
#find_all("标签", 属性="值") # 全文检索，返回列表

# table = page.find("table", class_="hg_table")
table = page.find("table", attrs={"class": "hg_table"}) #和上一行同一个意思
trs = page.find_all("tr")[1:]

for i in trs:
tds = i.find_all("td") #拿到每行数据
name = tds[0].text # .text表示拿到被标签标记的内容
low = tds[1].text

csvwrite.writerow([name, low])
f.close()

xpath

xpath是在XML文档中搜索内容的的一门语言

html是xml的一个子集

pip install lxmlfrom lxml import etree

xml = """
<book>
<id>1</id>
<name></name>
<price></price>
<nick></nick>
<author>
<nick>1</nick>
<nick>1</nick>
<nick>1</nick>
<nick>1</nick>

<div id="1">
<a href="dapao">大炮</a>
<nick>2</nick>
<div>
<nick>3</nick>
<div>
<nick>4</nick>
</div>
</div>
</div>
<div id="2">
<a href="feiji">飞机</a>
<nick>22</nick>
<div>
<nick>3</nick>
<div>
<nick>4</nick>
</div>
</div>
</div>

</author>

</book>
"""

mes = etree.XML(xml) # .Html() .parse()读取文件里面的内容
res1 = mes.xpath("/book") # /表示层级关系第一个/是跟节点
print(res1)
res2 = mes.xpath("/book/name")
print(res2)
res3 = mes.xpath("/book/name/text()") # text（）拿name标签标记的内容
print(res3)
res4 = mes.xpath("/book/author//nick/text()") # //后代所有的nick标签里面的内容都是检索到
print(res4)
res5 = mes.xpath("/book/author/*/nick/text()") # * 表示任意的节点，通配符
print(res5)
res6 = mes.xpath("/book/author/div[@id='1']/a/text()") # 根据属性值检索
print(res6)
res7 = mes.xpath("/book/author/div/@id") # 检索属性的值
print(res7)

----------
[<Element book at 0x27f205a33c0>]
[<Element name at 0x27f20616100>]
[]
['1', '1', '1', '1', '2', '3', '4', '22', '3', '4']
['2', '22']
['大炮']
['1', '2']

上一篇：【邮政编码识别】基于计算机视觉实现邮政编码识别含Matlab源码
下一篇：没有了