6.3. Character set encodings

This subsystems deals with how interscript deals with character sets. We will define a character set as subset of the integers called codepoints; in interscript, each code point is directly represented by the corresponding Python integer.

A font can be thought of as a relation from codepoints to glyphs. If glyphs can be said to have shape, then where two fonts map a codepoint to two glyphs, the glyphs generally have the same shape.

A codepoint mapping is a partial function between character sets designed to reflect a common notion of character, as might be reflected in the shape of glyphs from appopriate fonts.

An encoding is a relation from a codepoint sequence to a sequence of bytes (called octets in ISO 10646 parlance), the inverse is a decoding. Sequences of codepoints and octets are called strings, the finite collections of such strings, including the null string, form a category under concatenation, and form a category under lexicographical ordering.

When the encoding relation is a functor on concatenations, that is, it maps the empty string to the empty string, and maps concatenations of codepoints to corresponding concatenations of octets, then the encoding can be uniquely represented by a mapping from codepoints to octet strings.

Interscript provides several popular encoding schemes.


6.3.1. Representation