Each of the 128 blocks of 64K characters from this set are called a plane. The first plane agrees with the 16 bit Unicode character set. The following diagram is adapted from the linux man page by Markus Kuhn mailto:mskuhn@cip.informatik.uni-erlangen.de and shows how the encoding works clearly.
0x00000000 - 0x0000007F: 0xxxxxxx 00-7F 80/7F 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx C0-DF 80-BF E0/1F C0/3F 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx E0-EF F0/0F 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx F0-F7 F8/07 0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx F8-FB FC/03 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx FC-FD FE/01The first row shows the unicode range in hex. The second row shows the utf8 encoding in binary: the xxx bit positions are filled with the bits of the character code number in binary representation. Only the shortest possible multibyte sequence which can represent the code number of the character can be used. The third row shows this byte ranges in hex. The fourth row shows the mask required to select the fixed bits position of each byte, and the mask required to select the variable (xxx) positions.
The python function utf8 encodes an integer in a string using the utf-8 encoding. The function seq_to_utf8 translates a unicode string, represented by a sequence of integers, into a utf8 string.
1: #line 63 "utf8.ipk" 2: def utf8(i): 3: if i < 0x80: 4: return chr(i) 5: if i < 0x800: 6: return chr(0xC0 | (i>>6) & 0x1F)+\ 7: chr(0x80 | i & 0x3F) 8: if i < 0x10000L: 9: return chr(0xE0 | (i>>12) & 0xF)+\ 10: chr(0x80 | (i>>6) & 0x3F)+\ 11: chr(0x80 | i & 0x3F) 12: if i < 0x200000L: 13: return chr(0xF0 | (i>>18) & 0x7)+\ 14: chr(0x80 | (i>>12) & 0x3F)+\ 15: chr(0x80 | (i>>6) & 0x3F)+\ 16: chr(0x80 | i & 0x3F) 17: if i < 0x4000000L: 18: return chr(0xF8 | (i>>24) & 0x3)+\ 19: chr(0x80 | (i>>18) & 0x3F)+\ 20: chr(0x80 | (i>>12) & 0x3F)+\ 21: chr(0x80 | (i>>6) & 0x3F)+\ 22: chr(0x80 | i & 0x3F) 23: return chr(0xFC | (i>>30) & 0x1)+\ 24: chr(0x80 | (i>>24) & 0x3F)+\ 25: chr(0x80 | (i>>18) & 0x3F)+\ 26: chr(0x80 | (i>>12) & 0x3F)+\ 27: chr(0x80 | (i>>6) & 0x3F)+\ 28: chr(0x80 | i & 0x3F) 29: 30: def seq_to_utf8(a): 31: s = '' 32: for ch in a: s = s + utf8(ch) 33: return s 34: 35: def parse_utf8(s,i): 36: lead = ord(s[i]) 37: if lead & 0x80 == 0: 38: return lead & 0x7F,i+1 # ASCII 39: if lead & 0xE0 == 0xC0: 40: return ((lead & 0x1F) << 6)|\ 41: (ord(s[i+1]) & 0x3F),i+2 42: if lead & 0xF0 == 0xE0: 43: return ((lead & 0x1F)<<12)|\ 44: ((ord(s[i+1]) & 0x3F) <<6)|\ 45: (ord(s[i+2]) & 0x3F),i+3 46: if lead & 0xF8 == 0xF0: 47: return ((lead & 0x1F)<<18)|\ 48: ((ord(s[i+1]) & 0x3F) <<12)|\ 49: ((ord(s[i+2]) & 0x3F) <<6)|\ 50: (ord(s[i+3]) & 0x3F),i+4 51: if lead & 0xFC == 0xF8: 52: return ((lead & 0x1F)<<24)|\ 53: ((ord(s[i+1]) & 0x3F) <<18)|\ 54: ((ord(s[i+2]) & 0x3F) <<12)|\ 55: ((ord(s[i+3]) & 0x3F) <<6)|\ 56: (ord(s[i+4]) & 0x3F),i+5 57: if lead & 0xFE == 0xFC: 58: return ((lead & 0x1F)<<30)|\ 59: ((ord(s[i+1]) & 0x3F) <<24)|\ 60: ((ord(s[i+2]) & 0x3F) <<18)|\ 61: ((ord(s[i+3]) & 0x3F) <<12)|\ 62: ((ord(s[i+4]) & 0x3F) <<6)|\ 63: (ord(s[i+5]) & 0x3F),i+6 64: return lead, i+1 # error, just use bad character 65: