6.3. Character set encodings

This subsystems deals with how interscript deals with character sets. We will define a character set as subset of the integers called codepoints; in interscript, each code point is directly represented by the corresponding Python integer.

A font can be thought of as a relation from codepoints to glyphs. If glyphs can be said to have shape, then where two fonts map a codepoint to two glyphs, the glyphs generally have the same shape.

A codepoint mapping is a partial function between character sets designed to reflect a common notion of character, as might be reflected in the shape of glyphs from appopriate fonts.

An encoding is a relation from a codepoint sequence to a sequence of bytes (called octets in ISO 10646 parlance), the inverse is a decoding. Sequences of codepoints and octets are called strings, the finite collections of such strings, including the null string, form a category under concatenation, and form a category under lexicographical ordering.

When the encoding relation is a functor on concatenations, that is, it maps the empty string to the empty string, and maps concatenations of codepoints to corresponding concatenations of octets, then the encoding can be uniquely represented by a mapping from codepoints to octet strings.

Interscript provides several popular encoding schemes.

UCS-4 and UCS-4le are invertible 4 octet encodings of 32 bit integers, representing the integers in high to low and low to high order respectively. UCS-4 is prefered, because it is also order preserving. Both encodings are invertible.
UTF-8 encodes integers upto 31 bits into 1 to 6 octet strings. The encoding is order preserving and monic.
DBCS (double byte) encodings map some set of integers to either one or two octets.
A 7 bit encoding maps 7 bit integers to a single octet; ASCII is the most popular such encoding.
An 8 bit encoding maps 8 bit integers to a single octet; ISO8859 and most Microsoft and IBM code pages fall into this category.

6.3.1. Representation