Python Unicode实战

1. 各种编码的简要发展史


  随着计算机的逐渐发展,其他国家也需要将本国的语言在计算机中进行表示。部分国家使用128~255进行字母和符号进行表示。但是对于中文来说,剩余的256个位置根本无法表示汉字。既然一个字节无法表示中文,那么就用两个来表示吧。为了兼容原有的字符,所以当单个字节小于128时,就表示原有字符。当​连续两个字节​都大于128,具体来说是​​高字节​​​位于区间[0xA1,0xF7]时,​​低字节​​​位于区间[0xA1, 0xFE]时,就表示一个汉字。上述编码也就是​​GB2312​​​,具体可参考链接:​​https://www.wikiwand.com/zh-hans/GB_2312​​。

  但是GB2312也无法表示全部的汉字,所以将高低字节的范围都进行了扩展,高字节的范围区间修改为了[81, FE],而低字节的范围区间修改为了[40, 7E]和[80, FE]。这种编码也就是​​GBK​​​。具体可参考链接:​​https://www.wikiwand.com/zh-hans/GBK​​。

  与此同时,其他国家也为自己国家的语言设计了相应的编码。但结果导致除了英文以外,各国语言的编码都无法进行兼容。国际标谁化组织(ISO)意识到问题的严重性,设计了一种包含所有国家语言单元的编码,也就是​​Unicode​​。考虑到性能和资源的平衡,最终使用两个字节来表示字符,由于2^16=65535,所以可以基本上涵盖绝大多数语言的字符单元。相比于之前的单双字节并存的编码方式,双字节是如何对原有单字节对应的字符进行表示呢?其实很简单,添加全0作为高字节,原有单字节作为低字节。但这样一来,英文字符就得用两个字节来进行表示,就会造成资源的浪费。举例来说,It’s 日报对应的Unicode编码如下所示:

I 00000000 01001001
t 00000000 01110100
' 00000000 00100111
s 00000000 01110011
00000000 00100000
日 01100101 11100101
报 01100010 10100101


  • 单字节的字符,字节的第一位设为0,对于英语文本,UTF-8码只占用一个字节,和ASCII码完全相同;
  • n个字节的字符(n>1),第一个字节的前n位设为1,第n+1位设为0,后面字节的前两位都设为10,这n个字节的其余空位填充该字符unicode码,高位用0补足。这样就形成了如下的UTF-8标记位:
  • 0xxxxxxx
    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx... ...

    所以It’s 日报的编码就变成了:

    I 01001001
    t 01110100
    ' 00100111
    s 01110011
    日 11100110 10010111 10100101
    报 11100110 10001010 10100101


    2. Python Unicode实战

    2.1 操作单个字符

    2.1.1 判断单个字符所属类型


    • [Cc] Other, Control

    • [Cf] Other, Format

    • [Pc] Punctuation, Connector

    • [Pd] Punctuation, Dash

    • [Pe] Punctuation, Close

    • [Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)

    • [Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage)

    • [Po] Punctuation, Other

    • [Ps] Punctuation, Open

    • [Mn] Mark, Nonspacing

    • [Zs] Separator, Space


    2.1.2 判断单个字符是否属于中文

    def is_chinese_char(cp):
    """Checks whether CP is the codepoint of a CJK character."""
    # This defines a "chinese character" as anything in the CJK Unicode block:
    # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
    # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
    # despite its name. The modern Korean Hangul alphabet is a different block,
    # as is Japanese Hiragana and Katakana. Those alphabets are used to write
    # space-separated words, so they are not treated specially and handled
    # like the all of the other languages.
    if ((cp >= 0x4E00 and cp <= 0x9FFF) or #
    (cp >= 0x3400 and cp <= 0x4DBF) or #
    (cp >= 0x20000 and cp <= 0x2A6DF) or #
    (cp >= 0x2A700 and cp <= 0x2B73F) or #
    (cp >= 0x2B740 and cp <= 0x2B81F) or #
    (cp >= 0x2B820 and cp <= 0x2CEAF) or
    (cp >= 0xF900 and cp <= 0xFAFF) or #
    (cp >= 0x2F800 and cp <= 0x2FA1F)): #
    return True

    return False

    2.1.3 判断是否是空白符


    def is_whitespace(char):
    """Checks whether `chars` is a whitespace character."""
    # \t, \n, and \r are technically contorl characters but we treat them
    # as whitespace since they are generally considered as such.
    if char == " " or char == "\t" or char == "\n" or char == "\r":
    return True
    cat = unicodedata.category(char)
    if cat == "Zs":
    return True
    return False

    2.1.4 判断是否是控制符


    import unicodedata

    def _is_control(char):
    """Checks whether `chars` is a control character."""
    # These are technically control characters but we count them as whitespace
    # characters.
    if char == "\t" or char == "\n" or char == "\r":
    return False
    cat = unicodedata.category(char)
    if cat in ("Cc", "Cf"):
    return True
    return False

    2.1.5 是否为标点符号


    def is_punctuation(char):
    """Checks whether `chars` is a punctuation character."""
    cp = ord(char)
    # We treat all non-letter/number ASCII as punctuation.
    # Characters such as "^", "$", and "`" are not in the Unicode
    # Punctuation class but we treat them as punctuation anyways, for
    # consistency.
    if ((cp >= 33 and cp <= 47) or (cp >= 58 and cp <= 64) or
    (cp >= 91 and cp <= 96) or (cp >= 123 and cp <= 126)):
    return True
    cat = unicodedata.category(char)
    if cat.startswith("P"):
    return True
    return False

    2.2 字符串处理

    2.2.1 将文本转换成Unicode


    def convert_to_unicode(text):
    """Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
    if isinstance(text, str):
    return text
    elif isinstance(text, bytes):
    return text.decode("utf-8", "ignore")
    raise ValueError("Unsupported string type: %s" % (type(text)))

    2.2.2 清理文本中的无效字符


    def clean_text(text):
    """Performs invalid character removal and whitespace cleanup on text."""
    output = []
    for char in text:
    cp = ord(char)
    if cp == 0 or cp == 0xfffd or is_control(char):
    if is_whitespace(char):
    output.append(" ")
    return "".join(output)

    2.2.3 清理文本中的重音符号

    def strip_accents(text):
    """Strips accents from a piece of text."""
    text = unicodedata.normalize("NFD", text)
    output = []
    for char in text:
    cat = unicodedata.category(char)
    if cat == "Mn":
    return "".join(output)

    2.2.4 将字符串中的文本和标点符号进行划分


    def split_on_punc(text):
    """Splits punctuation on a piece of text."""
    chars = list(text)
    i = 0
    start_new_word = True
    output = []
    while i < len(chars):
    char = chars[i]
    if is_punctuation(char):
    start_new_word = True
    if start_new_word:
    start_new_word = False
    i += 1

    return ["".join(x) for x in output]

    2.2.5 对文本进行分词

    def tokenize(text):
    """Tokenizes a piece of text."""
    text = convert_to_unicode(text)
    text = clean_text(text)
    text = tokenize_chinese_chars(text)

    orig_tokens = whitespace_tokenize(text)# str to list of str
    split_tokens = []
    for token in orig_tokens:# get str of list of str
    if self.do_lower_case:
    token = token.lower()
    token = strip_accents(token)
    split_tokens.extend(split_on_punc(token))# list of str

    output_tokens = whitespace_tokenize(" ".join(split_tokens))# list of str
    return output_tokens

      留个小疑问,为什么在orig_tokens = whitespace_tokenize(text)后又进行了output_tokens = whitespace_tokenize(" ".join(split_tokens)),也就是whitespace_tokenize执行两次的意义是在哪里呢?
