August 3, 2020

A New Fast and Flexible Lexer

A programming language lexer is the part in charge of converting textual code representation into a structured memory representation. In Red, it is accomplished by the load function, which calls the lower-level transcode native. Until now, Red was relying on a lexer entirely written using the Parse dialect. Though, the parsing rules were constructed to be easily maintained and not for performance. Rewriting those rules to speed them up could have been possible, but rewriting the lexer entirely in Red/System would give the ultimate performance. It might not matter for most user scripts, but given that Red is also a data format, we need a solution for fast (near-instant) loading of huge quantities of Red values stored in files or transferred through the network.

The new lexer main features are:
  • High performance, typically 50 to 200 times faster than the older one.
  • New scanning features: identify values and their datatypes without loading them.
  • Instrumentation: customize the lexer's behavior at will using an event-oriented API.

The reference documentation is available there. This new lexer is available in Red's auto-builds since June.

Performance


Vastly increased performance is the main driver for this new lexer. Here is a little benchmark to let you appreciate how far it gets.

The benchmarking tasks are:
  • 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
  • 1M short integers: loads a string of 1 million `1` separated by a space.
  • 1M long integers: loads a string of 1 million `123456789` separated by a space.
  • 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
  • 1M characters: loads a string of 1 million `#"A"` separated by a space.
  • 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
  • 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
  • 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.

And the results are (on a Core i7-4790K):
    Loading Task             v0.6.4 (sec)    Current (sec)    Gain factor
    ---------------------------------------------------------------------
			
    100 x compiler.r	      41.871            0.463	           90
    1M short integers	      14.295            0.071	          201
    1M long integers	      18.105            0.159	          114
    1M dates	              29.319	        0.389	           75
    1M characters             14.865            0.092             162
    1M escaped characters     14.909	        0.120             124
    1M words	                 n/a	        1.216	          n/a
    100K words	              23.183	        0.070	          331
Notes

- Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).

- The "1M words" task fails on 0.6.4 as the symbol table expansion time is exponential due to some hashtable bugs. That also explains the big gap for the "100K words" task. Those issues are fixed in the current version and the symbol table further optimized for speed. Though, the execution time increase between 100K and 1M words tests in new lexer is not linear which may be explained by a high number of collisions in the internal hashtable due to limited input variability.

- The 0.6.4's lexer can only process strings as input, while the new lexer only processes internally only UTF-8 binary inputs. The input strings were converted to the lexer's native format in order to more accurately compare their speed. Providing a string instead of a binary series as input to the new lexer incurs on average a ~10% speed penalty.

Scanning


It is now possible to only scan tokens instead of loading them. Basically, that means identifying a token's length and type without loading it (so without requiring extra memory and processing time). This is achieved by using the new scan native.
    >> scan "123"
    == integer!
    >> scan "w:"
    == set-word!
    >> scan "user@domain.com"
    == email!
    >> scan "123a"
    == error!

It is possible to achieve even higher scanning speed by giving up a bit on accuracy. That is the purpose of the scan/fast refinement. It trades maximum performance for type recognition accuracy. You can find the list of "guessed" types in the table there.

    >> scan/fast "123"
    == integer!
    >> scan/fast "a:"
    == word!
    >> scan/fast "a/b"
    == path!

Scanning applies to the first token in the input series. When an iterative application is needed in order to scan all tokens from a given input, the /next refinement can be used for that. It will return the input series past the current token allowing to get the precise token size in the input string. It can be used in combination with /fast if required. For example:
    src: "hello 123 world 456.789"
    until [
        probe first src: scan/next src
        empty? src: src/2
    ]
Outputs:
    word!
    integer!
    word!
    float!

Matching by datatype in Parse


The new lexer enables also matching by datatype directly from Parse dialect. Though, this feature is limited to binary input only.
    >> parse to-binary "Hello 2020 World!" [word! integer! word!]
    == true
    >> parse to-binary "My IP is 192.168.0.1" [3 word! copy ip tuple!]
    == true
    >> ip
    == #{203139322E3136382E302E31}
    >> load ip
    == 192.168.0.1
Notice that the whitespaces in front of tokens are skipped automatically in this matching mode.

Instrumentation


Lexers in Red and Rebol world used to be black boxes, this is no longer the case with Red's new lexer and its tracing capabilities. It is now possible to provide a callback function that will be called upon lexer events triggered while parsing tokens. It gives deeper control to users, for example allowing to:
  • Trace the behavior of the lexer for debugging or statistical purposes.
  • Catch errors and resume loading by skipping invalid data.
  • On-the-fly input transformation (to remove/alter some non-loadable parts).
  • Extend the lexer with new lexical forms.
  • Process serialized Red data without having to fully load the input.
  • Extract line comments that would be lost otherwise.

Lexer's tracing mode is activated by using the /trace refinement on transcode. The syntax is:
    transcode/trace <input> <callback>

    <input>    : series to load (binary! string!).
    <callback> : a callback function to process lexer events (function!).
That function is called on specific events generated by the lexer: prescan, scan, load, open, close, error. The callback function and events specification can be found there

A default tracing callback is provided in system/lexer/tracer:
    >> transcode/trace "hello 123" :system/lexer/tracer
    
    prescan word 1x6 1 " 123"
    scan word 1x6 1 " 123"
    load word hello 1 " 123"
    prescan integer 7x10 1 ""
    scan integer 7x10 1 ""
    load integer 123 1 ""
    == [hello 123]
That tracing function will simply print the lexer event info. If a syntax error occurs, it will cancel it and resume on the next character after the error position.

Several more sophisticated examples can be found on our red/code repository.


Implementation notes


This new lexer has been specifically prototyped and designed for performance. It relies on a token-oriented pipelined approach consisting of 3 stages: prescanning, scanning and loading.

Prescanning is achieved using only a tight loop and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in the FSM, using a big transition table. Once a terminal state (state names with a `T_` prefix) or input's end is reached, the loop exits, leading to the next stage. The result of the prescanning stage is to locate a token begin/end positions and give a pretty accurate guess about the token's datatype. It can also detect some syntax errors if the FSM cannot reach a proper datatype terminal state. This approach provides the fastest possible speed for tokens detection, but it cannot be fully accurate, nor can it validate deeply the token content for some complex types (e.g. dates). 

Adding more states would provide greater accuracy and cover more syntatic forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the lexical table uses 1024 bytes and there two other minor tables used in the tight loop). The data cache also needs to handle the parsed input data and part of the native stack, so the available space is limited.

The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently only relies on two branchings. The loop code could be also further reduced by, for example, pre-multiplying the state values to avoid the multiplication when calculating the table entry offset. Though, we need to wait for a fully optimizing code generation backend before trying to extract more performance from that loop code, or we might be taking wrong directions.

Scanning stage happens when a token has been identified. It consists in eventually calling a scanner function to deep-check the token for errors and more accurately determine the datatype. Loading stage then follows (unless only scanning was requested by the user). It will eventually call a loader function that will construct the Red value out of the token. In case of any-block series, the scanners will actually do the series construction on reaching the ending delimiter (which requires special handling for paths), so no loader is needed there. Conversely, loaders can be invoked in validating mode only (not constructing the value), in order to avoid code duplication when complex code is required for decoding/validating the token (e.g. date!, time!, strings with UTF-8 decoding,...).

For the record, there was an attempt at creating specific FSM for date! and time! literal forms parsing, to reduce the amount of rules that need to be handled by pure code. The results were not conclusive, as the amount of code required for special case handling was still significant and the performance of the FSM parsing loop was below the current pure code version. This approach can be reexamined once we get the fully optimizing backend.

The FSM states, lexical classes and transitions are documented in lexer-states.txt file. A simple syntax is used to describe the transitions and possible branching from one state to others. The FSM has three possible entry points: S_START, S_PATH and S_M_STRING. Parsing path items requires specific states even for common types. For curly-braced strings, it is necessary to exit the FSM on each occurrence of open/close curly braces in order to count the nested ones and accurately determine where it ends. In both those path and string cases, the FSM needs to be re-entered in a different state than S_START.

In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition table data in binary. It basically looks like this:
    FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
  1. Manually edit changes in the lexer-states.txt file.
  2. Port those changes into the lexer.xlsx file by properly setting the transition values.
  3. Save that Excel table in CSV format as lexer.csv.
  4. Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
The lexer code relies on several other tables for specific types handling like path ending detection, floating point numbers syntax validation, binary series and escaped characters decoding. Those tables are either manually written (not planned to be ever changed) or generated using this script.

Various other points worth mentioning:

- The lexer works natively with UTF-8 encoded binary buffers as input. If a string! is provided as input, there is an overhead for converting internally such string to binary before passing it to the lexer. A unique internal buffer is used for those conversions with support for recursive calls.

- The lexer uses a single accumulative cells buffer for storing loaded values, with an inlined any-block stack.

- The lexer and lexer callbacks are fully recursive and GC-compliant. Currently callbacks can be function! only, this can be extended in the future to support routines also for much faster processing.

2 comments:

Fork me on GitHub