August 3, 2020

A New Fast and Flexible Lexer

A programming language lexer is the part in charge of converting textual code representation into a structured memory representation. In Red, it is accomplished by the load function, which calls the lower-level transcode native. Until now, Red was relying on a lexer entirely written using the Parse dialect. Though, the parsing rules were constructed to be easily maintained and not for performance. Rewriting those rules to speed them up could have been possible, but rewriting the lexer entirely in Red/System would give the ultimate performance. It might not matter for most user scripts, but given that Red is also a data format, we need a solution for fast (near-instant) loading of huge quantities of Red values stored in files or transferred through the network.

The new lexer main features are:
  • High performance, typically 50 to 200 times faster than the older one.
  • New scanning features: identify values and their datatypes without loading them.
  • Instrumentation: customize the lexer's behavior at will using an event-oriented API.

The reference documentation is available there. This new lexer is available in Red's auto-builds since June.

Performance


Vastly increased performance is the main driver for this new lexer. Here is a little benchmark to let you appreciate how far it gets.

The benchmarking tasks are:
  • 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
  • 1M short integers: loads a string of 1 million `1` separated by a space.
  • 1M long integers: loads a string of 1 million `123456789` separated by a space.
  • 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
  • 1M characters: loads a string of 1 million `#"A"` separated by a space.
  • 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
  • 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
  • 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.

And the results are (on a Core i7-4790K):
    Loading Task             v0.6.4 (sec)    Current (sec)    Gain factor
    ---------------------------------------------------------------------
			
    100 x compiler.r	      41.871            0.463	           90
    1M short integers	      14.295            0.071	          201
    1M long integers	      18.105            0.159	          114
    1M dates	              29.319	        0.389	           75
    1M characters             14.865            0.092             162
    1M escaped characters     14.909	        0.120             124
    1M words	                 n/a	        1.216	          n/a
    100K words	              23.183	        0.070	          331
Notes

- Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).

- The "1M words" task fails on 0.6.4 as the symbol table expansion time is exponential due to some hashtable bugs. That also explains the big gap for the "100K words" task. Those issues are fixed in the current version and the symbol table further optimized for speed. Though, the execution time increase between 100K and 1M words tests in new lexer is not linear which may be explained by a high number of collisions in the internal hashtable due to limited input variability.

- The 0.6.4's lexer can only process strings as input, while the new lexer only processes internally only UTF-8 binary inputs. The input strings were converted to the lexer's native format in order to more accurately compare their speed. Providing a string instead of a binary series as input to the new lexer incurs on average a ~10% speed penalty.

Scanning


It is now possible to only scan tokens instead of loading them. Basically, that means identifying a token's length and type without loading it (so without requiring extra memory and processing time). This is achieved by using the new scan native.
    >> scan "123"
    == integer!
    >> scan "w:"
    == set-word!
    >> scan "user@domain.com"
    == email!
    >> scan "123a"
    == error!

It is possible to achieve even higher scanning speed by giving up a bit on accuracy. That is the purpose of the scan/fast refinement. It trades maximum performance for type recognition accuracy. You can find the list of "guessed" types in the table there.

    >> scan/fast "123"
    == integer!
    >> scan/fast "a:"
    == word!
    >> scan/fast "a/b"
    == path!

Scanning applies to the first token in the input series. When an iterative application is needed in order to scan all tokens from a given input, the /next refinement can be used for that. It will return the input series past the current token allowing to get the precise token size in the input string. It can be used in combination with /fast if required. For example:
    src: "hello 123 world 456.789"
    until [
        probe first src: scan/next src
        empty? src: src/2
    ]
Outputs:
    word!
    integer!
    word!
    float!

Matching by datatype in Parse


The new lexer enables also matching by datatype directly from Parse dialect. Though, this feature is limited to binary input only.
    >> parse to-binary "Hello 2020 World!" [word! integer! word!]
    == true
    >> parse to-binary "My IP is 192.168.0.1" [3 word! copy ip tuple!]
    == true
    >> ip
    == #{203139322E3136382E302E31}
    >> load ip
    == 192.168.0.1
Notice that the whitespaces in front of tokens are skipped automatically in this matching mode.

Instrumentation


Lexers in Red and Rebol world used to be black boxes, this is no longer the case with Red's new lexer and its tracing capabilities. It is now possible to provide a callback function that will be called upon lexer events triggered while parsing tokens. It gives deeper control to users, for example allowing to:
  • Trace the behavior of the lexer for debugging or statistical purposes.
  • Catch errors and resume loading by skipping invalid data.
  • On-the-fly input transformation (to remove/alter some non-loadable parts).
  • Extend the lexer with new lexical forms.
  • Process serialized Red data without having to fully load the input.
  • Extract line comments that would be lost otherwise.

Lexer's tracing mode is activated by using the /trace refinement on transcode. The syntax is:
    transcode/trace <input> <callback>

    <input>    : series to load (binary! string!).
    <callback> : a callback function to process lexer events (function!).
That function is called on specific events generated by the lexer: prescan, scan, load, open, close, error. The callback function and events specification can be found there

A default tracing callback is provided in system/lexer/tracer:
    >> transcode/trace "hello 123" :system/lexer/tracer
    
    prescan word 1x6 1 " 123"
    scan word 1x6 1 " 123"
    load word hello 1 " 123"
    prescan integer 7x10 1 ""
    scan integer 7x10 1 ""
    load integer 123 1 ""
    == [hello 123]
That tracing function will simply print the lexer event info. If a syntax error occurs, it will cancel it and resume on the next character after the error position.

Several more sophisticated examples can be found on our red/code repository.


Implementation notes


This new lexer has been specifically prototyped and designed for performance. It relies on a token-oriented pipelined approach consisting of 3 stages: prescanning, scanning and loading.

Prescanning is achieved using only a tight loop and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in the FSM, using a big transition table. Once a terminal state (state names with a `T_` prefix) or input's end is reached, the loop exits, leading to the next stage. The result of the prescanning stage is to locate a token begin/end positions and give a pretty accurate guess about the token's datatype. It can also detect some syntax errors if the FSM cannot reach a proper datatype terminal state. This approach provides the fastest possible speed for tokens detection, but it cannot be fully accurate, nor can it validate deeply the token content for some complex types (e.g. dates). 

Adding more states would provide greater accuracy and cover more syntatic forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the lexical table uses 1024 bytes and there two other minor tables used in the tight loop). The data cache also needs to handle the parsed input data and part of the native stack, so the available space is limited.

The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently only relies on two branchings. The loop code could be also further reduced by, for example, pre-multiplying the state values to avoid the multiplication when calculating the table entry offset. Though, we need to wait for a fully optimizing code generation backend before trying to extract more performance from that loop code, or we might be taking wrong directions.

Scanning stage happens when a token has been identified. It consists in eventually calling a scanner function to deep-check the token for errors and more accurately determine the datatype. Loading stage then follows (unless only scanning was requested by the user). It will eventually call a loader function that will construct the Red value out of the token. In case of any-block series, the scanners will actually do the series construction on reaching the ending delimiter (which requires special handling for paths), so no loader is needed there. Conversely, loaders can be invoked in validating mode only (not constructing the value), in order to avoid code duplication when complex code is required for decoding/validating the token (e.g. date!, time!, strings with UTF-8 decoding,...).

For the record, there was an attempt at creating specific FSM for date! and time! literal forms parsing, to reduce the amount of rules that need to be handled by pure code. The results were not conclusive, as the amount of code required for special case handling was still significant and the performance of the FSM parsing loop was below the current pure code version. This approach can be reexamined once we get the fully optimizing backend.

The FSM states, lexical classes and transitions are documented in lexer-states.txt file. A simple syntax is used to describe the transitions and possible branching from one state to others. The FSM has three possible entry points: S_START, S_PATH and S_M_STRING. Parsing path items requires specific states even for common types. For curly-braced strings, it is necessary to exit the FSM on each occurrence of open/close curly braces in order to count the nested ones and accurately determine where it ends. In both those path and string cases, the FSM needs to be re-entered in a different state than S_START.

In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition table data in binary. It basically looks like this:
    FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
  1. Manually edit changes in the lexer-states.txt file.
  2. Port those changes into the lexer.xlsx file by properly setting the transition values.
  3. Save that Excel table in CSV format as lexer.csv.
  4. Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
The lexer code relies on several other tables for specific types handling like path ending detection, floating point numbers syntax validation, binary series and escaped characters decoding. Those tables are either manually written (not planned to be ever changed) or generated using this script.

Various other points worth mentioning:

- The lexer works natively with UTF-8 encoded binary buffers as input. If a string! is provided as input, there is an overhead for converting internally such string to binary before passing it to the lexer. A unique internal buffer is used for those conversions with support for recursive calls.

- The lexer uses a single accumulative cells buffer for storing loaded values, with an inlined any-block stack.

- The lexer and lexer callbacks are fully recursive and GC-compliant. Currently callbacks can be function! only, this can be extended in the future to support routines also for much faster processing.

March 20, 2020

GTK, fast lexer, money, deep testing, and our first commercial product

It's been a busy start to the year for the team. Work has continued on many fronts, and we have some new team members helping us to keep the momentum going. We announced in January that GTK and the Fast Lexer were close, and they are even closer now. The hard part about making announcements is that some of this work is unpredictable and changes in scope after we do. Or the world steps in and a pandemic throws a wrench into your plans.

@bitbegin has done an enormous amount of work on the GTK branch. You can check out the GTK branch to see some of what goes into supporting a new GUI system. The more features we include in Red, the more that have to be ported and maintained. Unfortunately, most operating systems and UI systems have large, complicated sets of APIs and interactions.

Because GUI systems are so complex, and Red not only has to handle them, but also adds its own reactive framework, there are more places for bugs to hide. And users are involved, which is the worst part. They do nothing but cause problems. To that end, in addition to his normal deep diving and bug hunting, @hiiamboris has been working on an automated view test system, which is no small feat. @9214 has joined him on the hunt, and we will squash a good number of bugs for our next release.

The fast lexer was in near-final testing when we decided that it was worth delaying its merge in order to incorporate some new lexical forms that we had planned to include. Then we looked at some old tickets related to modulo and division operators, and a couple lexing questions came up related to tag!. Suddenly the fast lexer work was back in code mode. New lexical forms usually means new datatypes, and that's the case here.

New Datatypes

One we've expected for some time, and thanks to @9214 it's now a reality. Money! is coming. There is a branch for it, but no need to comment at this time. There are a few features, like round still to be completed, but the bulk of the work is done. @BeardPower did some great experimental work that we thought might be used for money, based on Douglas Crockford's Dec64 design, but until Red is fully 64-bit it was only Dec32, and the limited range was determined not to be enough. That work won't go to waste though, it's just waiting for its moment in the sun. The current version of money! is a BCD implementation, but that shouldn't concern anyone outside the core team. What you care about is that it can be used for accurate financial calculations that don't suffer from floating point errors. It will also support an optional currency identifier, e.g. USD or EUR, and automatic group separators when formed.

Another new type is called ref! and is still being designed. The basic concept is simple: @reference is a form most people are familiar with today, just as hashtags are known (though we use the historical name issue!, because the new name hadn't become part of our global lexicon back then). Issues, in Rebol 2, were a string type, but Red made them a word type instead. That has benefits (mainly efficiency), but also costs (symbol table space and lexical limitations). For instance, in R2 you could say #abc:123, but not so in Red. Life is compromise. Ref! will be a string type, making it quite flexible. While we most often think of them as referring to a person, they can refer to, for example, a location in a file. You can do that with strings too, of course, but that's the beauty of rich datatypes. By using a ref!, you can build rich dialects and make your intent clear in the data itself.

Finally, Red is going to add a new raw-string! form. It's a combination of the raw string literals some languages support, and heredoc. The goal is to make it easier to include content that would otherwise require escape sequences that sometimes clutter inline data and lead to errors. A lot of time and effort went into the (sometimes heated) discussion around the need, use cases, and syntax. Right now this is just a new literal form, rather than a whole new datatype. They will still be strings when loaded, until we see how they are used by others and if they deserve to be a separate type.

We don't add new datatypes lightly, and design choices have to keep the big picture in mind as well. Balancing the value of new types against their added complexity in the language is hard work, but satisfying if it makes everyone's life better. To that end, when these new types and features become available, we need your input on how they work, where you use them, and what gaps need to be filled.

Both the fast lexer and money datatype leverage some new features in Red/System, which we'll talk about in a future post.

All Aboard!

We're also very excited to announce the imminent release of our first commercial product. We alluded to it at the beginning of the year, and it's almost ready to leave the station. There will be plenty of train-related puns, because it's a Railroad syntax diagram generator. Here's what it looks like:
More details will come soon, but here's a short list of some of it's features:

  • Live coded diagramming. You write your grammar and the diagram appears in real time.
  • Red parse support (of course), but also support for foreign grammars like ABNF, McKeeman, and YACC. That's right, you can write using those grammars and get the same diagram. 
  • When writing parse rules, you can use block level parsing, rather than character level.
  • Test inputs. Not only can you test a single input, but also entire directories of test files.
  • Test specific rules. Not only can you test from your top-level rule, but any rule in your grammar. You can also find where a particular rule matches part of your input.
  • Custom diagram styles, including options for a cleaner, more abstract view, different charset rendering options, and whether actions (parens) are displayed.
  • The ability to generate inputs that your grammar will recognize. This opens up many other use cases, including educational ones. Here's one of my favorites, that @toomasv created:

Look at the sentences in the input area. Those were created by clicking the generate button. It's like parsing in reverse. So if you're not sure what inputs your grammar might recognize, or want to see examples, this lets you view things in a whole new way.

Stay Tuned

It's an exciting time to be a Reducer, and we're rolling with changes just like everyone else right now. The best way to stay healthy during this pandemic is to stay away from other people and take the Red pill.

Until next time.

January 1, 2020

Happy New Year!

Hello and happy new year, friends of Red! We have some exciting projects we’ve been working on that will be available this year, including a new product. Let’s talk a little about what the team has been working on behind the scenes.

(TL;DR: A cool new product with Red in 2020...plus, a robust preliminary draft of Parse documentation can now be previewed...CLI library...fast-lexer to merge soon...GTK on the horizon...and a new native OS calendar widget!)


Fork me on GitHub