August 4, 2021

Long Time No Blog

 It's been almost a year since our last blog post. Sorry about that. It's one of those things that falls off our radar without a person dedicated to it, and we run lean so don't have anyone filling that role right now. We know it's important, even if we have many other channels where people can get information. So here we are.

Last year was a tough year all around, even for us. We were already a remote-only team, but the effect the pandemic had on the world, particularly travel, hit us too. We had some team changes, and also split our focus into product development alongside core Red Language development. This is necessary for sustainability, because people don't pay for programming languages, and they don't pay for Open Source software. There's no need to comment on the exceptions to these cases, because they are exceptions. The commercial goal, starting out, is to focus on our core strengths and knowledge, building developer-centric tools. Our first product, DiaGrammar for Windows, was released in December 2020, and we've issued a number of updates to it since then. Our thanks to Toomasv for his ingenuity and dedication in creating DiaGrammar. We are a team, but he really accepted ownership of the project and took it from an idea to a great product. Truly, there is nothing else like it on the market. 

We learned a lot from the process of creating a product, and will apply that experience moving forward. An important lesson is that the product itself is only half the work. As technologists, we're used to writing the code and maybe writing some docs to go with it. We don't think about outreach, marketing, payments, support, upgrade processes for users, web site issues, announcements, and more. The first time you do something is the hardest, and we're excited to improve and learn more as we update DiaGrammar and work on our next product. We'll probably announce what it will be in Q4. One thing we can say right now is that the work on DiaGrammar led to a huge amount of work on a more general diagramming subsystem for Red. It's really exciting, and we'll talk more about that in a future blog post.

So what have we been doing?

Since our last blog post we've logged over 400 fixes and 100 features into Red itself. Some of these are small, but important, others are headline-worthy; some are deep voodoo and some visible to every Reducer (what we call Red users). For example, most people use the console (the REPL), so the fixes and improvements there are easy to see. A prime feature being that the GUI console, but not the CLI console, didn't show output if the UI couldn't process events. This could happen if you printed output in a tight loop. The results would only show up at the termination of the loop, when the system could breathe again. That's been addressed, but wasn't easy and still isn't perfect. Red is still single threaded, so there's no separate UI thread (pros and cons there). We make these tradeoffs every day, and need feedback from users and real world scenarios to help find the right balance. Less obvious are things like improvements to parse, which not everyone uses. Or how fmod works across platforms, and edge cases for lexical forms (e.g. is -1.#NaN valid?). The latter is particularly important, because Red is a data language first.

JSON is widely used, but people may not notice that the JSON decoder is 20x faster now, unless they're dealing with extremely large JSON datasets. JSON is so widely used that we felt the time spent, and the tradeoffs made, were worth it. It also nicely shows one of Red's strengths. Profiling showed that the codec spent a lot of time in its unescape function. @hiiamboris rewrote that as a Red/System routine, tweaked it, and got a massive speedup. No external compiler needed, no need to use C, and the code is inlined so it's all in context. Should your JSON be malformed, you'll also get nicer error information now. As always, Red gives you options. Use high level Red as much as possible, for the most concise and flexible code, but drop into Red/System when it makes sense.

Some features cross the boundary of what's visible. A huge amount of work went into D2D support on Windows. D2D is Direct2D, the hardware-accelerated backend for vector graphics on Windows. For users, nothing should change as all the details are hidden. But the rendering behavior is not exactly the same. We try to work around that, but sometimes users have to make adjustments as well; we know because DiaGrammar is written in Red and uses the draw dialect heavily. It's an important step forward, but comes at a cost. GDI+ is now a legacy graphics back end, and won't see regular updates. Time marches on and we need to look forward. As if @qtxie wasn't busy enough with that, he and @dockimbel also pushed Full I/O forward in a big way. It hasn't been merged into the main branch yet, but we expect that to happen soon. @rebolek has been banging on it, and has a full working HTTP protocol ready to go, which is great. TLS/SSL support gets an A+ rating, which is also a testament to the design and implementation. It's important to note that the new I/O system is a low level interface. The higher level API is still being designed. At the highest level, these details will all be hidden from users. You'll continue to use read, write, save, load exactly as you do today, unless you need async I/O. 

Another big "feature" came from @vazub: NetBSD support. The core team has to focus on what stands to help the project overall, with regard to users and visibility. Community support for lesser known platforms is key. If you're on one of those platforms, be (or find) a champion. We'll help all we can, but that's what Open Source is for. Thanks for this contribution @vazub!

We also have some new Python primers up, thanks to @GalenIvanov. Start at Coming-to-Red-from-Python. Information like this is enormously important. Red is quite different from other languages, and learning any new language can be hard. We're used to a set of functionality and behaviors, which sometimes makes the syntax the easiest part to learn. Just knowing what things are called is a learning curve. Red doesn't use the same names, because we (and Carl when he designed Rebol) took a more holistic view. That's a hard sell though. We feel the pain. A user who found Red posted a video as they tried to do some basic things. We learned a lot from watching it. Where other languages required you to import a networking library, it's already built into Red. When they were looking for request or http.get, and expecting strings to be used for URLs, they couldn't find answers. In Red you just read http://.... It's obvious to us, but not to the rest of the world. So these new primers are very exciting. We have reference docs, and Red by Example, but still haven't written a User's Guide for Red. We'll get there though. 

Why do things take so long?

Even with that many fixes and features logged, and huge amounts of R&D, it can still feel like progress is slow. The world moves fast, and software projects are often judged by their velocity. We even judge ourselves that way, and have to be reminded to stay the course, our course, rather than imitating others. Red's flexibility also comes into play. Where other languages may limit how you can express solutions, we don't. It's so flexible that people can do crazy things or perform advanced tricks which end up being logged as bugs and wishes. Sometimes we say No (a lot of times in fact), but we also try to keep an open mind. We have to ask "Should that be allowed?", "Why would you want to do that (even though I never have)?", and "What are the long term consequences?" We have to acknowledge that Red is a data format first, and we never want to break that. It has to evolve, but not breaking the format is fundamental. And while code is expected to change, once people depend on a function or library it causes them pain if we break compatibility. We don't want to do that, though sometimes we will for the greater good and the long view. There are technical bandages we can patch over things, but it's a big issue that doesn't have a single solution. Not just for us, but for all software development. We'll talk more about this in the future as well.

I'll note some internal projects related to our "slow and steady" process:

  • Composite is a simple function that does for strings what compose does for blocks. It's a basic interpolator. But the design has taken many turns. Not just in the possible notations, but whether it should be a mezzanine function, a macro, or both. Each has pros and cons (Side note: we don't often think about "cons" being an abbreviation for "consequences"). This simple design and discussion is stalled again, because another option would be a new literal form for interpolated strings. That's what other languages do, but is it a good fit for Red? We belabor the point of how tight lexical space is already, so have to weigh that against the value of a concise notation.
  • Non-native GUI. Red's native GUI system was chosen in response to Rebol's choice to go non-native. Unfortunately it's another case of needing both. Being cross platform is great for Red users, but Hell for us. Throw in mobile and it's even worse. Don't even talk about running in the browser. But every platform has native widget limitations. Once you move beyond static text, editable fields, buttons, and simple lists, you're in the realm of "never the twain shall meet". How do you define and interact with grids and tables or collapsible trees? Red already has its own rich-text widget, so you don't have to embed (even if you could) an entire web browser and then write in HTML and CSS. To address all this, with much research and extensive use case outlines, @hiiamboris has spent a lot of time and effort on Red Spaces. Show me native widgets that can do editable spiral text, put any layout inside a rotator, or define recursive UIs. I didn't think so. Oh, and the wiggling you see in the GIFs there are not mistakes or artifacts, they are tests to show that any piece of the UI can be animated.
  • Other projects include format, split, HOFs, and modules, each with a great deal of design work and thought put into them. As an example, look at Boris' HOF analysis. They are large and important pieces, based on historical and contemporary research, but not something we will just drop into Red, though we could. A simple map function is a no-brainer, and could have been there day one. But that's not how we work. It's not a contest to see how many features we can add, or how fast; but how we can move software forward, make things easier, and push the state of the art. Not just in technical features (the engineering part), but in the design of a language and its ecosystem.

Not Everyone Has These Problems

An important aspect of Red is being self-contained. We talk about this a lot. Yes, we're considering LLVM as a target, but that has a big cost, not just benefits. Using our own compiler for everything also has costs, like slowing the move to 64-bit which is an issue for Mac users now. Workarounds like VMs and Docker containers are just that. We want things to be easy for you, but that doesn't mean they're easy for us. Here's an example.

Boris found a bug related to printing time values in Red/System. @dockimbel finally tracked it down, and posted this investigative report:

@hiiamboris It was a (R/S) compiler issue afterall. ;-)size? a was the guilty part. The compiler was wrongly generating code for loading a even though size? is statically evaluated by the compiler and replaced by a static integer value. Given that a was a float type, its value was pushed onto the x87 FPU stack, but never popped. That stack has a 7 slots limit. Running the loop 5 times was enough to leave only 2 slots free. When the big float expression is encountered in dtoa library, it requires 3 free slots on the FPU stack, which fails and results in producing a NaN value, which wreaks havoc in the rest of the code.

The fix in the compiler was trivial (fetch-expression/final vs fetch-expression) but getting to that point was not. Understanding machine architectures at the lowest levels isn't for everyone, but even though our compiler code will be rewritten in the future, it's small and maintainable today. If we rely on GCC, Clang, or other compilers, hitting a bug may mean hitting a wall. So while there are costs to using our own compiler, there are also costs to depending on others. Robert Heinlein popularized  TANSTAAFL, but the concept is not science fiction. As a side note, just as we moved from GDI+ to D2D,  x87 for float support was an early choice meant to support older platforms and we are planning to switch to SSE.
If compilers are your thing, or you like system level programming, join our community and get to know Red/System. See how our toolchain works, and consider joining us.

The Big Picture

I just read 101 things I learned in architecture school (which I heard about via Kevlin Henney, (though it may not have been that specific talk) and what struck me the most about it is how we've commandeered the word "architecture" for software but completely removed the human aspect. An architect does so much more than we do. Software architects are really structural engineers. If a single developer builds a complete app, they have to do the UI. They engineer a building and then slap on whatever sheathing is at hand, cutting doors and windows without concern for their location. And the app is viewed in isolation, as if it's the only thing a user has on their system, without consideration for its site, context, or relationships. What makes real architecture hard (and why the author notes that many architects hit their stride when they are older), is that you have to know so much. So many considerations, disciplines, and constraints are involved, and you have to unify them. It's both creative and scientific. What makes great architecture great is that it makes your experience better. Maybe even wonderful. If we only think about the mechanical aspects, our software will never be beautiful.

We haven't articulated this view for what we do, I think because we didn't realize it. At least I didn't. We talk about the whole being greater than the sum of its parts, and not just making everything libraries so it feels more natural and less mechanical. How a REPL and single exe make it easier to get started, and not having to use many tools is better. But we haven't explicitly said "Here's how it's laid out, and why. Here's how it's put together; these are the critical elements. Here's what it looks like from a distance, and when you enter its space." Implicitly we do that every day, through the work, but we don't talk about it. Or only once a year.

August 20, 2020

Red/System: New Features

In the past months, many new features were added to Red/System, the low-level dialect embedded in Red. Here is a sum up if you missed them.

Subroutines

During the work on the low-level parts of the new Red lexer, the need arised for intra-function factorization abilities to keep the lexer code as DRY as possible. Subroutines were introduced to solve that. They act as the GOSUB directive from Basic language. They are defined as a separate block of code inside a function's body and are called like regular functions (but without any arguments). So they are much lighter and faster than real function calls and require just one slot of stack space to store the return address. 

The declaration syntax is straightforward:

    <name>: [<body>]

    <name> : subroutine's name (local variable).
    <body> : subroutine's code (regular R/S code).

To define a subroutine, you need to declare a local variable with the subroutine! datatype, then set that variable to a block of code. You can then invoke the subroutine by calling its name from anywhere in the function body (but after the subroutine own definition).

Here is a first example of a fictive function processing I/O events:

    process: func [buf [byte-ptr!] event [integer!] return: [integer!]
        /local log do-error [subroutine!]
    ][
        log: [print-line [">>" tab e "<<"]]
        do-error: [print-line ["** Error:" e] return 1]
    
        switch event [
            EVT_OPEN  [e: "OPEN"  log unless connect buf [do-error]]
            EVT_READ  [e: "READ"  log unless receive buf [do-error]]
EVT_WRITE [e: "WRITE" log unless send buf [do-error]]
EVT_CLOSE [e: "CLOSE" log unless close buf [do-error]]
default [e: "<unknown>" do-error] ] 0 ]

This second example is more complete. It shows how subroutines can be combined and how values can be returned from a subroutine:

    #enum modes! [
    	CONV_UPPER
    	CONV_LOWER
    	CONV_INVERT
    ]

    convert: func [mode [modes!] text [c-string!] return: [c-string!]
        /local
            lower? upper? alpha? do-conv [subroutine!]
            delta [integer!]
            s     [c-string!]
            c     [byte!]
    ][
        lower?:  [all [#"a" <= c c <= #"z"]]
        upper?:  [all [#"A" <= c c <= #"Z"]]
        alpha?:  [any [lower? upper?]]
        do-conv: [s/1: s/1 + delta]
        delta:   0
        s:       text

        while [s/1 <> null-byte][
            c: s/1
            if alpha? [
                switch mode [
                    CONV_UPPER  [if lower? [delta: -32 do-conv]]
                    CONV_LOWER  [if upper? [delta: 32 do-conv]]
                    CONV_INVERT [delta: either upper? [32][-32] do-conv]
                    default     [assert false]
                ]
            ]
            s: s + 1
        ]
        text
    ]
    
    probe convert CONV_UPPER "Hello World!"
    probe convert CONV_LOWER "There ARE 123 Dogs."
    probe convert CONV_INVERT "This SHOULD be INVERTED!"

will output:

    HELLO WORLD!
    there are 123 dogs.
    tHIS should BE inverted!

Support for getting a subroutine address and dispatching dynamically on it is planned to be added in the future (something akin computed GOTO). More examples of subroutines can be found in the new lexer code, like in the load-date function.

New system intrinsics


Several new extensions to the system path have been added.

Lock-free atomic intrinsics


A simple low-level OS threads wrapper API has been added internally to the Red runtime as preliminary work on supporting parts of IO concurrency and parallel processing in the future. In order to complement it, a set of atomic intrinsics were added to enable the implementation of lock-free and wait-free algorithms in a multithreaded execution context.

The new atomic intrinsics are all documented here. Here is a quick overview:
  • system/atomic/fence: generates a read/write data memory barrier.
  • system/atomic/load: thread-safe atomic read from a given memory location.
  • system/atomic/store: thread-safe atomic write to a given memory location.
  • system/atomic/cas: thread-safe atomic compare&swap to a given memory location.
  • system/atomic/<math-op>: thread-safe atomic math or bitwise operation to a given memory location (add, sub, or, xor, and).

Other new intrinsics
  • system/stack/allocate/zero: allocates a storage space on stack and zero-fill it.
  • system/stack/push-all: saves all registers to stack.
  • system/stack/pop-all: restores all registers from stack.
  • system/fpu/status: retrieves the FPU exception bits status as a 32-bit integer.

Improved literal arrays

The main change is the removal of the hidden size inside the /0 index slot. The size of a literal array can now only be retrieved using the size? keyword, which is resolved at compile time (rather than run-time for /0 index access).

A notable addition is the support for binary arrays. Those arrays can be used to store byte-oriented tables or embed arbitray binary data into the source code. For example:

    table: #{0042FA0100CAFE00AA}
    probe size? table                      ;-- outputs 9
    probe table/2                          ;-- outputs "B"
    probe as integer! table/2              ;-- outputs 66
The new Red lexer code uses them extensively.

Variables and arguments grouping

It is now possible to group the type declaration for local variables and function arguments. For example:

    foo: func [
        src dst    [byte-ptr!]
        mode delta [integer!]
        return:    [integer!]
        /local
            p q buf  [byte-ptr!]
            s1 s2 s3 [c-string!]
    ]
 

Note that the compiler supports those features through code expansion at compile time, so that error reports could show each argument or variable having its own type declaration.

Integer division handling

Integer division handling at low-level has notorious shortcomings with different handling for each edge case depending on the hardware platform. Intel IA-32 architecture tends to handle those cases in  a slightly safer way, while ARM architecture produces erroneous results silently typically for the following two cases:

  • division by zero
  • division overflow (-2147483648 / -1)

IA-32 CPU will generate an exception, while ARM ones will return invalid results (respectively 0 and -2147483648). This makes it difficult to produce code that will behave the same on both architectures when integer divisions are used. In order to reduce this gap, R/S compiler will now generate extra code to detect those cases for ARM targets and raise a runtime exception. Such extra checkings for ARM are produced only in debug compilation mode. In release mode, priority is given to performance, no runtime exception will occur in such cases on ARM (as the overhead is significant). So, be sure to check your code on ARM platform thoroughly in debug mode before releasing it. This is not a perfect solution, but at least, it makes it possible to detect those cases through testing in debug mode.

Others

Here is a list of other changes and fixes in no particular order:

  • Cross-referenced aliased fields in structs defined in same context are now allowed. Example:

        a!: alias struct! [next [b!] prev [b!]]
        b!: alias struct! [a [a!] c [integer!]]
  • -0.0 special float literal is now supported.
  • +1.#INF is also now supported as valid literal in addition to 1.#INF for positive infinite.
  • Context-aware get-words resolution.
  • New #inline directive to inline assembled binary code.
  • Dropped support for % and // operators on float types, as they were relying on FPU's relative support, the results were not reliable across platforms. Use fmod function instead from now on.
  • Added --show-func-map compilation option: when used, it will output a map of R/S function addresses/names, to ease low-level debugging.
  • FIX: issue #4102: ASSERT false doesn't work.
  • FIX: issue #4038: cast integer to float32 in math expression gives wrong result.
  • FIX: byte! to integer! conversion not happening in some cases. Example: i: as-integer (p/1 - #"0")
  • FIX: compiler state not fully cleaned up after premature termination. This affects multiple compilation jobs done in the same Rebol2 session, resulting in weird compilation errors.
  • FIX: issue #4414: round-trip pointer casting returns an incorrect result in some cases.
  • FIX: literal arrays containing true/false words could corrupt the array. Example: a: ["hello" true "world" false]
  • FIX: improved error report on bad declare argument.

August 3, 2020

A New Fast and Flexible Lexer

A programming language lexer is the part in charge of converting textual code representation into a structured memory representation. In Red, it is accomplished by the load function, which calls the lower-level transcode native. Until now, Red was relying on a lexer entirely written using the Parse dialect. Though, the parsing rules were constructed to be easily maintained and not for performance. Rewriting those rules to speed them up could have been possible, but rewriting the lexer entirely in Red/System would give the ultimate performance. It might not matter for most user scripts, but given that Red is also a data format, we need a solution for fast (near-instant) loading of huge quantities of Red values stored in files or transferred through the network.

The new lexer main features are:
  • High performance, typically 50 to 200 times faster than the older one.
  • New scanning features: identify values and their datatypes without loading them.
  • Instrumentation: customize the lexer's behavior at will using an event-oriented API.

The reference documentation is available there. This new lexer is available in Red's auto-builds since June.

Performance


Vastly increased performance is the main driver for this new lexer. Here is a little benchmark to let you appreciate how far it gets.

The benchmarking tasks are:
  • 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
  • 1M short integers: loads a string of 1 million `1` separated by a space.
  • 1M long integers: loads a string of 1 million `123456789` separated by a space.
  • 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
  • 1M characters: loads a string of 1 million `#"A"` separated by a space.
  • 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
  • 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
  • 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.

And the results are (on a Core i7-4790K):
    Loading Task             v0.6.4 (sec)    Current (sec)    Gain factor
    ---------------------------------------------------------------------
			
    100 x compiler.r	      41.871            0.463	           90
    1M short integers	      14.295            0.071	          201
    1M long integers	      18.105            0.159	          114
    1M dates	              29.319	        0.389	           75
    1M characters             14.865            0.092             162
    1M escaped characters     14.909	        0.120             124
    1M words	                 n/a	        1.216	          n/a
    100K words	              23.183	        0.070	          331
Notes

- Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).

- The "1M words" task fails on 0.6.4 as the symbol table expansion time is exponential due to some hashtable bugs. That also explains the big gap for the "100K words" task. Those issues are fixed in the current version and the symbol table further optimized for speed. Though, the execution time increase between 100K and 1M words tests in new lexer is not linear which may be explained by a high number of collisions in the internal hashtable due to limited input variability.

- The 0.6.4's lexer can only process strings as input, while the new lexer only processes internally only UTF-8 binary inputs. The input strings were converted to the lexer's native format in order to more accurately compare their speed. Providing a string instead of a binary series as input to the new lexer incurs on average a ~10% speed penalty.

Scanning


It is now possible to only scan tokens instead of loading them. Basically, that means identifying a token's length and type without loading it (so without requiring extra memory and processing time). This is achieved by using the new scan native.
    >> scan "123"
    == integer!
    >> scan "w:"
    == set-word!
    >> scan "user@domain.com"
    == email!
    >> scan "123a"
    == error!

It is possible to achieve even higher scanning speed by giving up a bit on accuracy. That is the purpose of the scan/fast refinement. It trades maximum performance for type recognition accuracy. You can find the list of "guessed" types in the table there.

    >> scan/fast "123"
    == integer!
    >> scan/fast "a:"
    == word!
    >> scan/fast "a/b"
    == path!

Scanning applies to the first token in the input series. When an iterative application is needed in order to scan all tokens from a given input, the /next refinement can be used for that. It will return the input series past the current token allowing to get the precise token size in the input string. It can be used in combination with /fast if required. For example:
    src: "hello 123 world 456.789"
    until [
        probe first src: scan/next src
        empty? src: src/2
    ]
Outputs:
    word!
    integer!
    word!
    float!

Matching by datatype in Parse


The new lexer enables also matching by datatype directly from Parse dialect. Though, this feature is limited to binary input only.
    >> parse to-binary "Hello 2020 World!" [word! integer! word!]
    == true
    >> parse to-binary "My IP is 192.168.0.1" [3 word! copy ip tuple!]
    == true
    >> ip
    == #{203139322E3136382E302E31}
    >> load ip
    == 192.168.0.1
Notice that the whitespaces in front of tokens are skipped automatically in this matching mode.

Instrumentation


Lexers in Red and Rebol world used to be black boxes, this is no longer the case with Red's new lexer and its tracing capabilities. It is now possible to provide a callback function that will be called upon lexer events triggered while parsing tokens. It gives deeper control to users, for example allowing to:
  • Trace the behavior of the lexer for debugging or statistical purposes.
  • Catch errors and resume loading by skipping invalid data.
  • On-the-fly input transformation (to remove/alter some non-loadable parts).
  • Extend the lexer with new lexical forms.
  • Process serialized Red data without having to fully load the input.
  • Extract line comments that would be lost otherwise.

Lexer's tracing mode is activated by using the /trace refinement on transcode. The syntax is:
    transcode/trace <input> <callback>

    <input>    : series to load (binary! string!).
    <callback> : a callback function to process lexer events (function!).
That function is called on specific events generated by the lexer: prescan, scan, load, open, close, error. The callback function and events specification can be found there

A default tracing callback is provided in system/lexer/tracer:
    >> transcode/trace "hello 123" :system/lexer/tracer
    
    prescan word 1x6 1 " 123"
    scan word 1x6 1 " 123"
    load word hello 1 " 123"
    prescan integer 7x10 1 ""
    scan integer 7x10 1 ""
    load integer 123 1 ""
    == [hello 123]
That tracing function will simply print the lexer event info. If a syntax error occurs, it will cancel it and resume on the next character after the error position.

Several more sophisticated examples can be found on our red/code repository.


Implementation notes


This new lexer has been specifically prototyped and designed for performance. It relies on a token-oriented pipelined approach consisting of 3 stages: prescanning, scanning and loading.

Prescanning is achieved using only a tight loop and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in the FSM, using a big transition table. Once a terminal state (state names with a `T_` prefix) or input's end is reached, the loop exits, leading to the next stage. The result of the prescanning stage is to locate a token begin/end positions and give a pretty accurate guess about the token's datatype. It can also detect some syntax errors if the FSM cannot reach a proper datatype terminal state. This approach provides the fastest possible speed for tokens detection, but it cannot be fully accurate, nor can it validate deeply the token content for some complex types (e.g. dates). 

Adding more states would provide greater accuracy and cover more syntatic forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the lexical table uses 1024 bytes and there two other minor tables used in the tight loop). The data cache also needs to handle the parsed input data and part of the native stack, so the available space is limited.

The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently only relies on two branchings. The loop code could be also further reduced by, for example, pre-multiplying the state values to avoid the multiplication when calculating the table entry offset. Though, we need to wait for a fully optimizing code generation backend before trying to extract more performance from that loop code, or we might be taking wrong directions.

Scanning stage happens when a token has been identified. It consists in eventually calling a scanner function to deep-check the token for errors and more accurately determine the datatype. Loading stage then follows (unless only scanning was requested by the user). It will eventually call a loader function that will construct the Red value out of the token. In case of any-block series, the scanners will actually do the series construction on reaching the ending delimiter (which requires special handling for paths), so no loader is needed there. Conversely, loaders can be invoked in validating mode only (not constructing the value), in order to avoid code duplication when complex code is required for decoding/validating the token (e.g. date!, time!, strings with UTF-8 decoding,...).

For the record, there was an attempt at creating specific FSM for date! and time! literal forms parsing, to reduce the amount of rules that need to be handled by pure code. The results were not conclusive, as the amount of code required for special case handling was still significant and the performance of the FSM parsing loop was below the current pure code version. This approach can be reexamined once we get the fully optimizing backend.

The FSM states, lexical classes and transitions are documented in lexer-states.txt file. A simple syntax is used to describe the transitions and possible branching from one state to others. The FSM has three possible entry points: S_START, S_PATH and S_M_STRING. Parsing path items requires specific states even for common types. For curly-braced strings, it is necessary to exit the FSM on each occurrence of open/close curly braces in order to count the nested ones and accurately determine where it ends. In both those path and string cases, the FSM needs to be re-entered in a different state than S_START.

In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition table data in binary. It basically looks like this:
    FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
  1. Manually edit changes in the lexer-states.txt file.
  2. Port those changes into the lexer.xlsx file by properly setting the transition values.
  3. Save that Excel table in CSV format as lexer.csv.
  4. Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
The lexer code relies on several other tables for specific types handling like path ending detection, floating point numbers syntax validation, binary series and escaped characters decoding. Those tables are either manually written (not planned to be ever changed) or generated using this script.

Various other points worth mentioning:

- The lexer works natively with UTF-8 encoded binary buffers as input. If a string! is provided as input, there is an overhead for converting internally such string to binary before passing it to the lexer. A unique internal buffer is used for those conversions with support for recursive calls.

- The lexer uses a single accumulative cells buffer for storing loaded values, with an inlined any-block stack.

- The lexer and lexer callbacks are fully recursive and GC-compliant. Currently callbacks can be function! only, this can be extended in the future to support routines also for much faster processing.

March 20, 2020

GTK, fast lexer, money, deep testing, and our first commercial product

It's been a busy start to the year for the team. Work has continued on many fronts, and we have some new team members helping us to keep the momentum going. We announced in January that GTK and the Fast Lexer were close, and they are even closer now. The hard part about making announcements is that some of this work is unpredictable and changes in scope after we do. Or the world steps in and a pandemic throws a wrench into your plans.

@bitbegin has done an enormous amount of work on the GTK branch. You can check out the GTK branch to see some of what goes into supporting a new GUI system. The more features we include in Red, the more that have to be ported and maintained. Unfortunately, most operating systems and UI systems have large, complicated sets of APIs and interactions.

Because GUI systems are so complex, and Red not only has to handle them, but also adds its own reactive framework, there are more places for bugs to hide. And users are involved, which is the worst part. They do nothing but cause problems. To that end, in addition to his normal deep diving and bug hunting, @hiiamboris has been working on an automated view test system, which is no small feat. @9214 has joined him on the hunt, and we will squash a good number of bugs for our next release.

The fast lexer was in near-final testing when we decided that it was worth delaying its merge in order to incorporate some new lexical forms that we had planned to include. Then we looked at some old tickets related to modulo and division operators, and a couple lexing questions came up related to tag!. Suddenly the fast lexer work was back in code mode. New lexical forms usually means new datatypes, and that's the case here.

New Datatypes

One we've expected for some time, and thanks to @9214 it's now a reality. Money! is coming. There is a branch for it, but no need to comment at this time. There are a few features, like round still to be completed, but the bulk of the work is done. @BeardPower did some great experimental work that we thought might be used for money, based on Douglas Crockford's Dec64 design, but until Red is fully 64-bit it was only Dec32, and the limited range was determined not to be enough. That work won't go to waste though, it's just waiting for its moment in the sun. The current version of money! is a BCD implementation, but that shouldn't concern anyone outside the core team. What you care about is that it can be used for accurate financial calculations that don't suffer from floating point errors. It will also support an optional currency identifier, e.g. USD or EUR, and automatic group separators when formed.

Another new type is called ref! and is still being designed. The basic concept is simple: @reference is a form most people are familiar with today, just as hashtags are known (though we use the historical name issue!, because the new name hadn't become part of our global lexicon back then). Issues, in Rebol 2, were a string type, but Red made them a word type instead. That has benefits (mainly efficiency), but also costs (symbol table space and lexical limitations). For instance, in R2 you could say #abc:123, but not so in Red. Life is compromise. Ref! will be a string type, making it quite flexible. While we most often think of them as referring to a person, they can refer to, for example, a location in a file. You can do that with strings too, of course, but that's the beauty of rich datatypes. By using a ref!, you can build rich dialects and make your intent clear in the data itself.

Finally, Red is going to add a new raw-string! form. It's a combination of the raw string literals some languages support, and heredoc. The goal is to make it easier to include content that would otherwise require escape sequences that sometimes clutter inline data and lead to errors. A lot of time and effort went into the (sometimes heated) discussion around the need, use cases, and syntax. Right now this is just a new literal form, rather than a whole new datatype. They will still be strings when loaded, until we see how they are used by others and if they deserve to be a separate type.

We don't add new datatypes lightly, and design choices have to keep the big picture in mind as well. Balancing the value of new types against their added complexity in the language is hard work, but satisfying if it makes everyone's life better. To that end, when these new types and features become available, we need your input on how they work, where you use them, and what gaps need to be filled.

Both the fast lexer and money datatype leverage some new features in Red/System, which we'll talk about in a future post.

All Aboard!

We're also very excited to announce the imminent release of our first commercial product. We alluded to it at the beginning of the year, and it's almost ready to leave the station. There will be plenty of train-related puns, because it's a Railroad syntax diagram generator. Here's what it looks like:
More details will come soon, but here's a short list of some of it's features:

  • Live coded diagramming. You write your grammar and the diagram appears in real time.
  • Red parse support (of course), but also support for foreign grammars like ABNF, McKeeman, and YACC. That's right, you can write using those grammars and get the same diagram. 
  • When writing parse rules, you can use block level parsing, rather than character level.
  • Test inputs. Not only can you test a single input, but also entire directories of test files.
  • Test specific rules. Not only can you test from your top-level rule, but any rule in your grammar. You can also find where a particular rule matches part of your input.
  • Custom diagram styles, including options for a cleaner, more abstract view, different charset rendering options, and whether actions (parens) are displayed.
  • The ability to generate inputs that your grammar will recognize. This opens up many other use cases, including educational ones. Here's one of my favorites, that @toomasv created:

Look at the sentences in the input area. Those were created by clicking the generate button. It's like parsing in reverse. So if you're not sure what inputs your grammar might recognize, or want to see examples, this lets you view things in a whole new way.

Stay Tuned

It's an exciting time to be a Reducer, and we're rolling with changes just like everyone else right now. The best way to stay healthy during this pandemic is to stay away from other people and take the Red pill.

Until next time.

January 1, 2020

Happy New Year!

Hello and happy new year, friends of Red! We have some exciting projects we’ve been working on that will be available this year, including a new product. Let’s talk a little about what the team has been working on behind the scenes.

(TL;DR: A cool new product with Red in 2020...plus, a robust preliminary draft of Parse documentation can now be previewed...CLI library...fast-lexer to merge soon...GTK on the horizon...and a new native OS calendar widget!)


December 4, 2019

November 2019 in Review

Welcome to December, friends of Red. It's an intent and focused time of year as we wind down 2019, and the core team is making important moves (sometimes literally!) to set us up for an ambitious 2020. But first, here are just a few things that happened in November.
  • First and foremost, it's always great when community members help compile resources for use by others, and we'd like to acknowledge @rebolek for his excellent compendium of historic automatic builds: https://rebolek.com/builds/ (they weren't available for a hot minute, but now they're back). They can be useful if you're in need of a previous version for a specific project. Of course, you can always go here for Red's daily automated builds. But seeing as how we've a goal of being a self-sustaining, self-selecting group of do-ers, this spirit of providing collective resources is perfectly aligned with the Red-Lang we always want to be. 
  • From the community, to outreach: thanks to community member @loziniak, Red is now on Stackshare, so be sure to follow us there and chime in. As a repo with 4.1k GitHub stars (and infinite possibilities), Red has a lot to offer the wider community of developers and engineers, and Stackshare is a great place to help compare and contrast us with other languages. 
  • Now, a challenge! In the coming new year we'll be needing beta testers willing to lend their expertise in refining a new product built with Red. You read that right! If you think you'd like to be one of the contributors to spearhead a move into our next phase, we want YOU! Drop a line to @greggirwin to get in on the ground floor.
  • An appreciation to @hiiamboris for his deeply thought out proposal regarding "series evolution," a framework for standardizing and testing the functions we use in Red for manipulating series. Design is hard, and we have a number of initiatives in the works taking a lot of brain power right now.
  • Over 60 commits were made to Red's GTK branch in November, making it almost ready for "prime time." The product of a lot of work by a fair handful of the core team, heavy lifter @bitbegin says that merging to master branch is possibly the next step.
  • And, to close, a bit of a tease: Watch this space for some snazzy website changes coming soon.
All the warmest wishes for the upcoming holidays, from the Red family to yours. <3

-Lucinda.

November 20, 2019

Editorial: A Brief Essay on Lexical Ambiguity by G. Irwin

The original commentary was posted in Red's Gitter channel, here, by Gregg Irwin, one of our core team members, in response to various requests for the ability to create new datatypes in Red.

As a writer, Red has always appealed to me because of its flexibility; but, of course, "the [lexicon] devil is in the details," as the idiom goes (okay, I edited that idiom a little, but it was too cool a link to pass up). It means the more specific we try to be, the more challenges and limitations we encounter, and we can lose some of the amazing versatility of the language. On the other hand, precision and refinement--the "exact right word at the exact right time," can powerfully enhance a language's utility. The dynamic tension between what he calls "generality and specificity, human friendliness and artifice," in the text below, can be an energetic ebb and flow that serves to strengthen our language, to make it more robust. 

Two quotes from community members provide some context:
_____________________________

> The real problem is not number of datatypes, but the lexical syntax of the new ones. -@Oldes

> ...However if something like utype! is added, nothing prevents you from (ab)using system/lexer/pre-load and reinventing whole syntax. -@Rebolek


"I don't support abusing system/lexer/pre-load, and (in the long view) there will almost certainly be special cases where a new lexical form makes sense. We can't see the future, so we can't rule it out. But, and this is key, how much value does each new one add?

I believe that each new lexical form adds less value, and there is a point of diminishing returns. This is not just a lexical problem for Red, but for humans. We have limited capacity to remember rules, and a constrained hierarchy helps enormously here. Think more like linguists, and less like programmers or mathematicians.

In language we have words and numbers. Numbers can be represented as words, with their notation being a handy shortcut for use in the domain of mathematics. And while we classify nouns, verbs, and adjectives by their use, they are all words, and don't have syntax specific to their particular part of speech. That's important because a single word may be used in more than one context, for more than one purpose.

This is interesting, as a tangent, because human language can be ambiguous, though some synthetic languages try to eliminate that (e.g. Lojban). The funny thing is that it's almost impossible to write poetry or tell jokes in Lojban. Nobodyยบ speaks Lojban. This ties to programming because, while we all know the strengths and value of strict typing, and even more extreme features and designs meant to promote correctness, dynamic languages are used more at higher levels [such as poetry, songwriting and humor, where even the sounds used in one single word can be employed to evoke specific emotive responses in the listener--the effects of devices like assonance, consonance, and loose associations we make with even single letters, in the way a repeated letter R throughout a line of poetry or literature can subtly impart a sense of momentum and intensity to the text...possibly because it evokes a growl... -Ed.]. Why is that? Humans.

When Carl designed Rebol, it had a goal, and a place in time. He had to choose just how far to go. Even what to call things like email!, which are very specific to a particular type of technology. This is what gives Redbol langs so much of their power. They were designed as a data format, meant for exchanging information. That's the core. What are the "things" we need to exchange information about with other humans, not just other programmers?

Do I want new types? I'm pushing for at least one: Ref! with an @some-name-here syntax. It's not username! or filename+line-number!, or specific in any way. It's very general, as lexical types should be; their use and meaning being context-specific (the R in Redbol, which stands for "relative"). I also think ~ could be a leading numeric sigil to denote approximation. It came mainly from wanting a syntax for floats, to make it clear that they are imprecise; but it's tricky, because it could also be much richer, and has to take variables into account. ~.1 is easy, but what about x = ~n+/-5%? Units are also high value, but they are just a combination of words and numbers. (Still maybe worth a lexical form.)

When we look at what Red should support, and the best way to let users fulfill application and purpose-specific needs, we can learn from the past, and also see that there is no single right answer. Structs, Maps, Objects, data structures and functions versus OOP, strict vs dynamic.

As Forth was all about "Build a vocabulary and write your program in that," think about what constitutes a vocabulary; a lexicon. It's a balance, in Red, between generality and specificity, human friendliness and artifice. So when we ask for things, myself and Nenad included, we should first try to answer our need with what is in Red today, and see where our proposed solution falls on the line of diminishing returns. To this end, we can and should abuse system/lexer/pre-load for experimentation."
Fork me on GitHub