September 22, 2012
Plan for Unicode support
Red is growing up fast, even if just born two weeks ago! It is time we implement basic string support so we can do our first, real, hello-word. ;-)
Red strings will natively support Unicode. In order to achieve that in an efficient and cross-platform way, we need a good plan. Here is the list of Unicode native formats used by our main target platforms API:
Windows : UTF-16
Linux : UTF-8
MacOSX/Cocoa : UTF-16
MacOSX/Darwin : UTF-8
Java : UTF-16
.Net : UTF-16
Syllable : UTF-8
All these formats are variable-width encodings, requiring any indexed access to pay the cost of walking through the string.
Fortunately, there are also fixed-width Unicode encodings that can be used to give us back constant time for indexed accesses. So, in order to make it the most space-efficient, Red strings will internally support only these encoding formats:
Latin-1 (1 byte/codepoint)
UCS-2 (2 bytes/codepoint)
UCS-4 (4 bytes/codepoint)
This is not something new, at least Python 3.3 does it in the same way.
Additionally, UTF-8 and UTF-16 codecs will be supported, in order to deal with I/O accesses on host platforms.
Red will use UTF-8 for exchanging strings with outer world by default, except when accessing a UTF-16 API is necessary. Conversion for input and output strings will be done on-the-fly between one of the internal representation and UTF-8/UTF-16. When reading an input string, Red will select the most space-efficient internal format depending on highest codepoint in the input string. Also users should be able to force the encoding of a string to a given internal format, when possible.
So far, this is the plan for additing Unicode to Red, a prototype implementation will be done quickly, so we can fine-tune it if required.
Comments and suggestions are welcome.