6.5.1. Utf-8 Encode/Decode

Interscript uses the UTF-8 encoding of the 31 bit ISO-10646 character set: this encoding is 'Implementation Level 3', in ISO parlance, meaning the complete character set is representable.

Each of the 128 blocks of 64K characters from this set are called a plane. The first plane agrees with the 16 bit Unicode character set. The following diagram is adapted from the linux man page by Markus Kuhn mailto:mskuhn@cip.informatik.uni-erlangen.de and shows how the encoding works clearly.

       0x00000000 - 0x0000007F:
           0xxxxxxx
           00-7F
           80/7F

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx
           C0-DF    80-BF
           E0/1F    C0/3F

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx
           E0-EF
           F0/0F

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
           F0-F7
           F8/07

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           F8-FB
           FC/03

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
           FC-FD
           FE/01
The first row shows the unicode range in hex. The second row shows the utf8 encoding in binary: the xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. The third row shows this byte ranges in hex. The fourth row shows the mask required to select the fixed bits position of each byte, and the mask required to select the variable (xxx) positions.

The python function utf8 encodes an integer in a string using the utf-8 encoding. The function seq_to_utf8 translates a unicode string, represented by a sequence of integers, into a utf8 string.

Start python section to interscript/encoding/utf8.py[1 /1 ]
     1: #line 63 "utf8.ipk"
     2: def utf8(i):
     3:   if i < 0x80:
     4:     return chr(i)
     5:   if i < 0x800:
     6:     return chr(0xC0 | (i>>6) & 0x1F)+\
     7:       chr(0x80 | i & 0x3F)
     8:   if i < 0x10000L:
     9:     return chr(0xE0 | (i>>12) & 0xF)+\
    10:       chr(0x80 | (i>>6) & 0x3F)+\
    11:       chr(0x80 | i & 0x3F)
    12:   if i < 0x200000L:
    13:     return chr(0xF0 | (i>>18) & 0x7)+\
    14:       chr(0x80 | (i>>12) & 0x3F)+\
    15:       chr(0x80 | (i>>6) & 0x3F)+\
    16:       chr(0x80 | i & 0x3F)
    17:   if i < 0x4000000L:
    18:     return chr(0xF8 | (i>>24) & 0x3)+\
    19:       chr(0x80 | (i>>18) & 0x3F)+\
    20:       chr(0x80 | (i>>12) & 0x3F)+\
    21:       chr(0x80 | (i>>6) & 0x3F)+\
    22:       chr(0x80 | i & 0x3F)
    23:   return chr(0xFC | (i>>30) & 0x1)+\
    24:     chr(0x80 | (i>>24) & 0x3F)+\
    25:     chr(0x80 | (i>>18) & 0x3F)+\
    26:     chr(0x80 | (i>>12) & 0x3F)+\
    27:     chr(0x80 | (i>>6) & 0x3F)+\
    28:     chr(0x80 | i & 0x3F)
    29: 
    30: def seq_to_utf8(a):
    31:   s = ''
    32:   for ch in a: s = s + utf8(ch)
    33:   return s
    34: 
    35: def parse_utf8(s,i):
    36:   lead = ord(s[i])
    37:   if lead & 0x80 == 0:
    38:     return lead & 0x7F,i+1 # ASCII
    39:   if lead & 0xE0 == 0xC0:
    40:     return ((lead & 0x1F) << 6)|\
    41:       (ord(s[i+1]) & 0x3F),i+2
    42:   if lead & 0xF0 == 0xE0:
    43:     return ((lead & 0x1F)<<12)|\
    44:       ((ord(s[i+1]) & 0x3F) <<6)|\
    45:       (ord(s[i+2]) & 0x3F),i+3
    46:   if lead & 0xF8 == 0xF0:
    47:     return ((lead & 0x1F)<<18)|\
    48:       ((ord(s[i+1]) & 0x3F) <<12)|\
    49:       ((ord(s[i+2]) & 0x3F) <<6)|\
    50:       (ord(s[i+3]) & 0x3F),i+4
    51:   if lead & 0xFC == 0xF8:
    52:     return ((lead & 0x1F)<<24)|\
    53:       ((ord(s[i+1]) & 0x3F) <<18)|\
    54:       ((ord(s[i+2]) & 0x3F) <<12)|\
    55:       ((ord(s[i+3]) & 0x3F) <<6)|\
    56:       (ord(s[i+4]) & 0x3F),i+5
    57:   if lead & 0xFE == 0xFC:
    58:     return ((lead & 0x1F)<<30)|\
    59:       ((ord(s[i+1]) & 0x3F) <<24)|\
    60:       ((ord(s[i+2]) & 0x3F) <<18)|\
    61:       ((ord(s[i+3]) & 0x3F) <<12)|\
    62:       ((ord(s[i+4]) & 0x3F) <<6)|\
    63:       (ord(s[i+5]) & 0x3F),i+6
    64:   return lead, i+1 # error, just use bad character
    65: 
End python section to interscript/encoding/utf8.py[1]


6.5.1.1. Test 14: utf8 round trip