特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

python difflib文本比较利器，入手不亏

来源：互联网收集：自由互联发布时间：2022-09-02

@[toc] 一、引言 difflib模块：是提供的类和方法用来进行序列的差异化比较，它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。其作为 python 的内置库，有着强大的文本

@[toc]

一、引言

difflib模块：是提供的类和方法用来进行序列的差异化比较，它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。其作为 python 的内置库，有着强大的文本比对功能，此篇介绍 difflib 两种常用的功能：字符串比较和文本比较。

二、正文

1. 字符串比较

1.1 计算原理

相似度 = 2.0*M / T （M表示两个字符串能够匹配到的字符数量， T表示字符总数）

1.2 参数介绍

SequenceMatcher(lambda x: x in "要去除的字符", "字符串1", "字符串2")

1.3 举个栗子

实例1 >>> s = SequenceMatcher(None, "abcd", "bcde") >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75
如果想要去除掉多余的字符再进行比较
实例2 >>> s = SequenceMatcher(lambda x: x in "|\", "abcd|", "dc\fa") # 去除两个字符中的 ( | ) 以及 ( \ ) 符号后比较 >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75

1.4 相关函数性能比较

函数计算速度内存开销 ratio() 快大 quick_ratio() 慢小

论证过程将相似度比对过程遍历100000遍得到计算速度与内存占用上的差异

# 导入第三方库 import os import psutil import time def show_info(): pid = os.getpid() #模块名比较容易理解：获得当前进程的pid p = psutil.Process(pid) #根据pid找到进程，进而找到占用的内存值 info = p.memory_full_info() memory = info.uss/1024/1024 return memory def func(ratio_func): start_time = time.time() # 记录起始时间 initial_memory = show_info() # 记录起始内存 if ratio_func == "ratio": ratio = [similarity.ratio() for i in range(1000000)] else: ratio = [similarity.quick_ratio() for i in range(1000000)] final_memory = show_info() # 记录终止内存 end_time = time.time() # 记录终止时间 print(f"耗时：{end_time-start_time}s") print(f'内存占用：{final_memory-initial_memory:.2f}MB') if __name__ == '__main__': similarity = difflib.SequenceMatcher(None, '需要比对的字符1', '需要比对的字符2') func("ratio") func("quick_ratio")

输出结果 >>> func("ratio") 耗时：0.9709699153900146s 内存占用：36.58MB >>> func("quick_ratio") 耗时：2.730135917663574s 内存占用：32.68MB

2. 文本比较

2.1 相关符号与含义

符号含义 - 仅在片段1中存在 + 仅在片段2中存在 ' ' (空格) 片段1和2中都存在 ? 下标显示 ^ 存在差异字符

2.2 Differ

以文本格式显示结果

示例代码 import difflib

text1 = '''

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.'''.splitlines(keepends=True)

text2 = '''

Beautifu is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.'''.splitlines(keepends=True)

#以文本方式展示两个文本的不同:d = difflib.Differ()result = list(d.compare(text1, text2))result = " ".join(result)print(result)

- 结果展示

Beautiful is better than ugly.? ^
Beautifu is better than ugly.? ^
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.

2.3 HtmlDiff

以html方式显示结果

示例代码 import difflib

text1 = '''

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.'''.splitlines(keepends=True)

text2 = '''

Beautifu is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.'''.splitlines(keepends=True)#以html方式展示两个文本的不同，浏览器打开:d = difflib.HtmlDiff()with open("passwd.html", 'w') as f:f.write(d.make_file(text1, text2)) - 结果展示 ![image.png](http://img.558idc.com/uploadfile/allimg/python/1660911791958174.png?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)

2.4 context_diff

返回一个差异文本行的生成器，用颜色高亮显示文本的增加，删除或者更改

实例代码 from difflib import context_diff import sys

s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):sys.stdout.write(line)

对于字符串列表进行比较，可以看出只有第四个元素是相同的,每个元素会依次进行比较，而不是按照索引进行比较，假使s1 = ['eggs\n', 'ham\n', 'guido\n']为三个元素 - 结果展示 ```python *** before.py --- after.py *************** *** 1,4 **** ! bacon ! eggs ! ham guido --- 1,4 ---- ! python ! eggy ! hamster guido

2.5 get_close_matches

返回最大匹配结果的列表

示例代码 from difflib import get_close_matches

d=get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])print(d)

- 结果展示 ```python ['apple', 'ape']

2.6 ndiff

返回一个文本格式的差异结果

示例代码 from difflib import ndiff

diff = ndiff('one\ntwo\nthree\n'.splitlines(1),'ore\ntree\nemu\n'.splitlines(1))print(''.join(diff))

- 结果展示 ```python - one ? ^ + ore ? ^ - two - three ? - + tree + emu

2.7 restore

返回一个由两个比对序列产生的结果

示例代码 from difflib import ndiff, restore

diff = ndiff('one\ntwo\nthree\n'.splitlines(1),'ore\ntree\nemu\n'.splitlines(1))diff = list(diff) # materialize the generated delta into a listprint(''.join(restore(diff, 1)))

- 结果展示 ```python one two three

3. 参考链接

python中difflib模块祥解

上一篇：python pandas SettingWithCopyWarning 解决方案
下一篇：没有了

python difflib文本比较利器，入手不亏

一、引言

二、正文

1. 字符串比较

1.1 计算原理

1.2 参数介绍

1.3 举个栗子

1.4 相关函数性能比较

2. 文本比较

2.1 相关符号与含义

2.2 Differ

2.3 HtmlDiff

2.4 context_diff

2.5 get_close_matches

2.6 ndiff

2.7 restore

3. 参考链接

相关文章