当前位置 : 主页 > 编程语言 > python >

python difflib文本比较利器,入手不亏

来源:互联网 收集:自由互联 发布时间:2022-09-02
@[toc] 一、引言 difflib模块:是提供的类和方法用来进行序列的差异化比较,它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。其作为 python 的内置库,有着强大的文本

@[toc]

一、引言

difflib模块:是提供的类和方法用来进行序列的差异化比较,它能够比对文件并生成差异结果文本或者html格式的差异化比较页面。其作为 python 的内置库,有着强大的文本比对功能,此篇介绍 difflib 两种常用的功能:字符串比较和文本比较。

二、正文

1. 字符串比较

1.1 计算原理

相似度 = 2.0*M / T (M表示两个字符串能够匹配到的字符数量, T表示字符总数)

1.2 参数介绍

SequenceMatcher(lambda x: x in "要去除的字符", "字符串1", "字符串2")

1.3 举个栗子

  • 实例1 >>> s = SequenceMatcher(None, "abcd", "bcde") >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75

    如果想要去除掉多余的字符再进行比较

  • 实例2 >>> s = SequenceMatcher(lambda x: x in "|\", "abcd|", "dc\fa") # 去除两个字符中的 ( | ) 以及 ( \ ) 符号后比较 >>> s.ratio() 0.75 >>> s.quick_ratio() 0.75

1.4 相关函数性能比较

函数 计算速度 内存开销 ratio() 快 大 quick_ratio() 慢 小
  • 论证过程将相似度比对过程遍历100000遍得到计算速度与内存占用上的差异
# 导入第三方库 import os import psutil import time def show_info(): pid = os.getpid() #模块名比较容易理解:获得当前进程的pid p = psutil.Process(pid) #根据pid找到进程,进而找到占用的内存值 info = p.memory_full_info() memory = info.uss/1024/1024 return memory def func(ratio_func): start_time = time.time() # 记录起始时间 initial_memory = show_info() # 记录起始内存 if ratio_func == "ratio": ratio = [similarity.ratio() for i in range(1000000)] else: ratio = [similarity.quick_ratio() for i in range(1000000)] final_memory = show_info() # 记录终止内存 end_time = time.time() # 记录终止时间 print(f"耗时:{end_time-start_time}s") print(f'内存占用:{final_memory-initial_memory:.2f}MB') if __name__ == '__main__': similarity = difflib.SequenceMatcher(None, '需要比对的字符1', '需要比对的字符2') func("ratio") func("quick_ratio")
  • 输出结果 >>> func("ratio") 耗时:0.9709699153900146s 内存占用:36.58MB >>> func("quick_ratio") 耗时:2.730135917663574s 内存占用:32.68MB

2. 文本比较

2.1 相关符号与含义

符号 含义 - 仅在片段1中存在 + 仅在片段2中存在 ' ' (空格) 片段1和2中都存在 ? 下标显示 ^ 存在差异字符

2.2 Differ

以文本格式显示结果

  • 示例代码 import difflib

text1 = '''

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.'''.splitlines(keepends=True)
  • text2 = '''

  • Beautifu is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.'''.splitlines(keepends=True)
  • #以文本方式展示两个文本的不同:d = difflib.Differ()result = list(d.compare(text1, text2))result = " ".join(result)print(result)

    - 结果展示
    • Beautiful is better than ugly.? ^
    • Beautifu is better than ugly.? ^
    • Explicit is better than implicit.
    • Simple is better than complex.
    • Complex is better than complicated.

    2.3 HtmlDiff

    以html方式显示结果

    • 示例代码 import difflib

    text1 = '''

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.'''.splitlines(keepends=True)
  • text2 = '''

  • Beautifu is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.'''.splitlines(keepends=True)#以html方式展示两个文本的不同, 浏览器打开:d = difflib.HtmlDiff()with open("passwd.html", 'w') as f:f.write(d.make_file(text1, text2)) - 结果展示 ![image.png](http://img.558idc.com/uploadfile/allimg/python/1660911791958174.png?x-oss-process=image/watermark,size_14,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_30,g_se,x_10,y_10,shadow_20,type_ZmFuZ3poZW5naGVpdGk=)
  • 2.4 context_diff

    返回一个差异文本行的生成器, 用颜色高亮显示文本的增加,删除或者更改

    • 实例代码 from difflib import context_diff import sys

    s1 = ['bacon\n', 'eggs\n', 'ham\n', 'guido\n']s2 = ['python\n', 'eggy\n', 'hamster\n', 'guido\n']for line in context_diff(s1, s2, fromfile='before.py', tofile='after.py'):sys.stdout.write(line)

    对于字符串列表进行比较,可以看出只有第四个元素是相同的,每个元素会依次进行比较,而不是按照索引进行比较,假使s1 = ['eggs\n', 'ham\n', 'guido\n']为三个元素 - 结果展示 ```python *** before.py --- after.py *************** *** 1,4 **** ! bacon ! eggs ! ham guido --- 1,4 ---- ! python ! eggy ! hamster guido

    2.5 get_close_matches

    返回最大匹配结果的列表

    • 示例代码 from difflib import get_close_matches

    d=get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])print(d)

    - 结果展示 ```python ['apple', 'ape']

    2.6 ndiff

    返回一个文本格式的差异结果

    • 示例代码 from difflib import ndiff

    diff = ndiff('one\ntwo\nthree\n'.splitlines(1),'ore\ntree\nemu\n'.splitlines(1))print(''.join(diff))

    - 结果展示 ```python - one ? ^ + ore ? ^ - two - three ? - + tree + emu

    2.7 restore

    返回一个由两个比对序列产生的结果

    • 示例代码 from difflib import ndiff, restore

    diff = ndiff('one\ntwo\nthree\n'.splitlines(1),'ore\ntree\nemu\n'.splitlines(1))diff = list(diff) # materialize the generated delta into a listprint(''.join(restore(diff, 1)))

    - 结果展示 ```python one two three

    3. 参考链接

    python中difflib模块祥解

    上一篇:python pandas SettingWithCopyWarning 解决方案
    下一篇:没有了
    网友评论