September 22, 2012

Plan for Unicode support


Red is growing up fast, even if just born two weeks ago! It is time we implement basic string support so we can do our first, real, hello-word. ;-)

Red strings will natively support Unicode. In order to achieve that in an efficient and cross-platform way, we need a good plan. Here is the list of Unicode native formats used by our main target platforms API:

        Windows       : UTF-16
        Linux         : UTF-8
        MacOSX/Cocoa  : UTF-16
        MacOSX/Darwin : UTF-8
        Java          : UTF-16
        .Net          : UTF-16
        Javascript    : UTF-8
        Syllable      : UTF-8
   
All these formats are variable-width encodings, requiring any indexed access to pay the cost of walking through the string.

Fortunately, there are also fixed-width Unicode encodings that can be used to give us back constant time for indexed accesses. So, in order to make it the most space-efficient, Red strings will internally support only these encoding formats:

        Latin-1 (1 byte/codepoint)
        UCS-2   (2 bytes/codepoint)
        UCS-4   (4 bytes/codepoint)

This is not something new, at least Python 3.3 does it in the same way.

Additionally, UTF-8 and UTF-16 codecs will be supported, in order to deal with I/O accesses on host platforms.

Red will use UTF-8 for exchanging strings with outer world by default, except when accessing a UTF-16 API is necessary. Conversion for input and output strings will be done on-the-fly between one of the internal representation and UTF-8/UTF-16. When reading an input string, Red will select the most space-efficient internal format depending on highest codepoint in the input string. Also users should be able to force the encoding of a string to a given internal format, when possible.

So far, this is the plan for additing Unicode to Red, a prototype implementation will be done quickly, so we can fine-tune it if required.

Comments and suggestions are welcome.

10 comments:

  1. A true lightning start. And very wise not to invent the wheel on fire again.

    ReplyDelete
  2. Have you decided how to handle keyboard input? Will Red expect all keyboard input to be either UTF-8 or UTF-16 or will it handle other encodings (e.g. Windows Codepages)?

    ReplyDelete
  3. That will depend on what encoding for the pressed key the host system gives us (codepoint, UTF-8, UTF-16, ...).

    ReplyDelete
  4. To show unicode support use Greek:
    Χαῖρε κόσμε!
    That is "Hello world!" in Greek (polytonic)

    ReplyDelete
  5. I'm already using greek in my early Unicode tests!

    greek: "αβγδεζηθικλμνξοπρςστυφχψω"

    I will add your HelloWorld greek version to the Chinese one for the final HelloWorld script, thanks! ;-)

    ReplyDelete
  6. nice! :)
    Greek is my native language but I just made some research to solve some questions, especially about the comma. I found the phrase even with no connection to computer programming (for example here: http://agonasax.blogspot.gr/2012_06_01_archive.html )
    About the comma although I found both using it and not using it, I concluded that it is better to be used. What make my mind was mostly the akathist. There it is used (Χαῖρε, νύμφη ἀνύμφευτε)

    So
    Χαῖρε, κόσμε!
    might be better.

    ;-)

    ReplyDelete
  7. I guess you will like this one: http://static.red-lang.org/hello_unicode.png ;-)

    ReplyDelete
  8. When you refer to Latin-1 are you referring specifically to ISO-8859-1 ?

    ReplyDelete
  9. Yes, of course I like it. :)
    The good think about polytonic in this case is that it "proves" unicode is in use. Monotonic can also be ISO-8859-7 or windows-1253, but polytonic only lives in unicodeland.

    A funny accident, both english and greek "hello, world" have the same numbers of letters.

    In Python3 you can name variables in non ascii characters! Although in most cases this would be a very bad practice as it may produce confusion. (Never seen a program like this and nobody mentions it but you can try it and see) One case that it may not be bad is if someone is not doing programming but mathematica-like computations. I wonder if this happens in red too

    the hello_unicode.png says 2011 and not 2012, why?

    ReplyDelete
  10. "When you refer to Latin-1 are you referring specifically to ISO-8859-1?" Yes.

    "In Python3 you can name variables in non ascii characters!" You can do the same in Red, forgot to include that in the screenshot...next time. ;-) I agree it's a bad practice to use non-ascii for variable/function names. In case of Red and REBOL-like languages, it's an intrinsic feature. The lexer could be restricting it, but at runtime, you would be able to workaround it anyway.

    "the hello_unicode.png says 2011 and not 2012, why?" I have copy/pasted the standard header I use from other files in the github repository. I keep it as is, because I need to update dozens of file headers to set them to 2012 and add other info using a batch script But there's always more important things to do, so, in the meantime, I try to keep file headers the same in order to make the batch processing easier.

    ReplyDelete

Fork me on GitHub