6.5.1. Utf-8 Encode/Decode

Each of the 128 blocks of 64K characters from this set are called a plane. The first plane agrees with the 16 bit Unicode character set. The following diagram is adapted from the linux man page by Markus Kuhn mailto:mskuhn@cip.informatik.uni-erlangen.de and shows how the encoding works clearly.

The python function utf8 encodes an integer in a string using the utf-8 encoding. The function seq_to_utf8 translates a unicode string, represented by a sequence of integers, into a utf8 string.

     1: #line 63 "utf8.ipk"
     2: def utf8(i):
     3:   if i < 0x80:
     4:     return chr(i)
     5:   if i < 0x800:
     6:     return chr(0xC0 | (i>>6) & 0x1F)+\
     7:       chr(0x80 | i & 0x3F)
     8:   if i < 0x10000L:
     9:     return chr(0xE0 | (i>>12) & 0xF)+\
    10:       chr(0x80 | (i>>6) & 0x3F)+\
    11:       chr(0x80 | i & 0x3F)
    12:   if i < 0x200000L:
    13:     return chr(0xF0 | (i>>18) & 0x7)+\
    14:       chr(0x80 | (i>>12) & 0x3F)+\
    15:       chr(0x80 | (i>>6) & 0x3F)+\
    16:       chr(0x80 | i & 0x3F)
    17:   if i < 0x4000000L:
    18:     return chr(0xF8 | (i>>24) & 0x3)+\
    19:       chr(0x80 | (i>>18) & 0x3F)+\
    20:       chr(0x80 | (i>>12) & 0x3F)+\
    21:       chr(0x80 | (i>>6) & 0x3F)+\
    22:       chr(0x80 | i & 0x3F)
    23:   return chr(0xFC | (i>>30) & 0x1)+\
    24:     chr(0x80 | (i>>24) & 0x3F)+\
    25:     chr(0x80 | (i>>18) & 0x3F)+\
    26:     chr(0x80 | (i>>12) & 0x3F)+\
    27:     chr(0x80 | (i>>6) & 0x3F)+\
    28:     chr(0x80 | i & 0x3F)
    29: 
    30: def seq_to_utf8(a):
    31:   s = ''
    32:   for ch in a: s = s + utf8(ch)
    33:   return s
    34: 
    35: def parse_utf8(s,i):
    36:   lead = ord(s[i])
    37:   if lead & 0x80 == 0:
    38:     return lead & 0x7F,i+1 # ASCII
    39:   if lead & 0xE0 == 0xC0:
    40:     return ((lead & 0x1F) << 6)|\
    41:       (ord(s[i+1]) & 0x3F),i+2
    42:   if lead & 0xF0 == 0xE0:
    43:     return ((lead & 0x1F)<<12)|\
    44:       ((ord(s[i+1]) & 0x3F) <<6)|\
    45:       (ord(s[i+2]) & 0x3F),i+3
    46:   if lead & 0xF8 == 0xF0:
    47:     return ((lead & 0x1F)<<18)|\
    48:       ((ord(s[i+1]) & 0x3F) <<12)|\
    49:       ((ord(s[i+2]) & 0x3F) <<6)|\
    50:       (ord(s[i+3]) & 0x3F),i+4
    51:   if lead & 0xFC == 0xF8:
    52:     return ((lead & 0x1F)<<24)|\
    53:       ((ord(s[i+1]) & 0x3F) <<18)|\
    54:       ((ord(s[i+2]) & 0x3F) <<12)|\
    55:       ((ord(s[i+3]) & 0x3F) <<6)|\
    56:       (ord(s[i+4]) & 0x3F),i+5
    57:   if lead & 0xFE == 0xFC:
    58:     return ((lead & 0x1F)<<30)|\
    59:       ((ord(s[i+1]) & 0x3F) <<24)|\
    60:       ((ord(s[i+2]) & 0x3F) <<18)|\
    61:       ((ord(s[i+3]) & 0x3F) <<12)|\
    62:       ((ord(s[i+4]) & 0x3F) <<6)|\
    63:       (ord(s[i+5]) & 0x3F),i+6
    64:   return lead, i+1 # error, just use bad character
    65: