A programming language lexer is the part in charge of converting textual code
  representation into a structured memory representation. In Red, it is
  accomplished by the
  load function, which calls
  the lower-level transcode
  native. Until now, Red was relying on a lexer entirely written using the
  Parse dialect. Though, the parsing rules were constructed to be easily maintained and not
  for performance. Rewriting those rules to speed them up could have been
  possible, but rewriting the lexer entirely in Red/System would give the
  ultimate performance. It might not matter for most user scripts, but given
  that Red is also a data format, we need a solution for fast (near-instant)
  loading of huge quantities of Red values stored in files or transferred
  through the network.
The new lexer main features are:
- High performance, typically 50 to 200 times faster than the older one.
 - New scanning features: identify values and their datatypes without loading them.
 - Instrumentation: customize the lexer's behavior at will using an event-oriented API.
 
The reference documentation is available there. This new lexer is available in Red's auto-builds since June.
Performance
  Vastly increased performance is the main driver for this new lexer. Here is a
  little benchmark to let you appreciate how far it gets.
The benchmarking tasks are:
- 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
 - 1M short integers: loads a string of 1 million `1` separated by a space.
 - 1M long integers: loads a string of 1 million `123456789` separated by a space.
 - 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
 - 1M characters: loads a string of 1 million `#"A"` separated by a space.
 - 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
 - 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
 - 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.
 
And the results are (on a Core i7-4790K):
    Loading Task             v0.6.4 (sec)    Current (sec)    Gain factor
    ---------------------------------------------------------------------
			
    100 x compiler.r	      41.871            0.463	           90
    1M short integers	      14.295            0.071	          201
    1M long integers	      18.105            0.159	          114
    1M dates	              29.319	        0.389	           75
    1M characters             14.865            0.092             162
    1M escaped characters     14.909	        0.120             124
    1M words	                 n/a	        1.216	          n/a
    100K words	              23.183	        0.070	          331
Notes: 
  - Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).
  - The "1M words" task fails on 0.6.4 as the symbol table expansion time is
  exponential due to some hashtable bugs. That also explains the big gap for the
  "100K words" task. Those issues are fixed in the current version and the
  symbol table further optimized for speed. Though, the execution time increase
  between 100K and 1M words tests in new lexer is not linear which may be
  explained by a high number of collisions in the internal hashtable due to
  limited input variability.
  - The 0.6.4's lexer can only process strings as input, while the new lexer
  only processes internally only UTF-8 binary inputs. The input strings were
  converted to the lexer's native format in order to more accurately compare
  their speed. Providing a string instead of a binary series as input to the new
  lexer incurs on average a ~10% speed penalty.
Scanning
  It is now possible to only scan tokens instead of loading them. Basically,
  that means identifying a token's length and type without loading it (so
  without requiring extra memory and processing time). This is achieved by using
  the new scan native.
    >> scan "123"
    == integer!
    >> scan "w:"
    == set-word!
    >> scan "user@domain.com"
    == email!
    >> scan "123a"
    == error!
  It is possible to achieve even higher scanning speed by giving up a bit on
  accuracy. That is the purpose of the scan/fast
  refinement. It trades maximum performance for type recognition accuracy. You
  can find the list of "guessed" types in the table
  there.
    >> scan/fast "123"
    == integer!
    >> scan/fast "a:"
    == word!
    >> scan/fast "a/b"
    == path!
  Scanning applies to the first token in the input series. When an iterative
  application is needed in order to scan all tokens from a given input, the
  /next refinement can be
  used for that. It will return the input series past the current token allowing
  to get the precise token size in the input string. It can be used in
  combination with /fast if
  required. For example:
    src: "hello 123 world 456.789"
    until [
        probe first src: scan/next src
        empty? src: src/2
    ]
  Outputs:
      word!
    integer!
    word!
    float!
Matching by datatype in Parse
  The new lexer enables also matching by datatype directly from Parse dialect.
  Though, this feature is limited to binary input only.
    >> parse to-binary "Hello 2020 World!" [word! integer! word!]
    == true
    >> parse to-binary "My IP is 192.168.0.1" [3 word! copy ip tuple!]
    == true
    >> ip
    == #{203139322E3136382E302E31}
    >> load ip
    == 192.168.0.1
  Notice that the whitespaces in front of tokens are skipped automatically in this
  matching mode.
Instrumentation
  Lexers in Red and Rebol world used to be black boxes, this is no longer the
  case with Red's new lexer and its tracing capabilities. It is now possible to
  provide a callback function that will be called upon lexer events triggered
  while parsing tokens. It gives deeper control to users, for example allowing
  to:
- Trace the behavior of the lexer for debugging or statistical purposes.
 - Catch errors and resume loading by skipping invalid data.
 - On-the-fly input transformation (to remove/alter some non-loadable parts).
 - Extend the lexer with new lexical forms.
 - Process serialized Red data without having to fully load the input.
 - Extract line comments that would be lost otherwise.
 
  Lexer's tracing mode is activated by using the
  /trace refinement on transcode. The syntax is:
    transcode/trace <input> <callback>
    <input>    : series to load (binary! string!).
    <callback> : a callback function to process lexer events (function!).
  That function is called on specific events generated by the lexer:
  prescan, scan, load, open, close, error. The
  callback function and events specification can be found there. 
  A default tracing callback is provided in
  system/lexer/tracer:
    >> transcode/trace "hello 123" :system/lexer/tracer
    
    prescan word 1x6 1 " 123"
    scan word 1x6 1 " 123"
    load word hello 1 " 123"
    prescan integer 7x10 1 ""
    scan integer 7x10 1 ""
    load integer 123 1 ""
    == [hello 123]
  That tracing function will simply print the lexer event info. If a syntax
  error occurs, it will cancel it and resume on the next character after the
  error position.
  Several more sophisticated examples can be found on our
  red/code repository.
Implementation notes
  This new lexer has been specifically prototyped and designed for performance.
  It relies on a token-oriented pipelined approach consisting of 3 stages:
  prescanning, scanning and loading.
  Prescanning is achieved using only a
  tight loop
  and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each
  byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in
  the
  FSM, using a big
  transition table. Once a terminal state (state names
  with a `T_` prefix) or input's end is reached, the loop exits, leading to the
  next stage. The result of the prescanning stage is to locate a token begin/end
  positions and give a pretty accurate guess about the token's datatype. It can
  also detect some syntax errors if the FSM cannot reach a proper datatype
  terminal state. This approach provides the fastest possible speed for tokens
  detection, but it cannot be fully accurate, nor can it validate deeply the
  token content for some complex types (e.g. dates). 
  Adding more states would provide greater accuracy and cover more syntatic
  forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite
  big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the
  lexical table uses 1024 bytes and there two other minor tables used in the
  tight loop). The data cache also needs to handle the parsed input data and
  part of the native stack, so the available space is limited.
  The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently
  only relies on two branchings. The loop code could be also further reduced by, for
  example, pre-multiplying the state values to avoid the multiplication when
  calculating the table entry offset. Though, we need to wait for a fully
  optimizing code generation backend before trying to extract more performance
  from that loop code, or we might be taking wrong directions.
  Scanning stage happens when a token has been identified. It consists in
  eventually calling a scanner function to deep-check the token for errors and
  more accurately determine the datatype. Loading stage then follows (unless
  only scanning was requested by the user). It will eventually call a loader
  function that will construct the Red value out of the token. In case of
  any-block series, the scanners will actually do the series construction on
  reaching the ending delimiter (which requires
  special handling
  for paths), so no loader is needed there. Conversely, loaders can be invoked
  in validating mode only (not constructing the value), in order to avoid code
  duplication when complex code is required for decoding/validating the
  token (e.g. date!, time!, strings with UTF-8 decoding,...).
  For the record, there was an
  attempt
  at creating specific FSM for date! and time! literal forms parsing, to reduce
  the amount of rules that need to be handled by pure code. The results were not
  conclusive, as the amount of code required for special case handling was
  still significant and the performance of the FSM parsing loop was below the
  current pure code version. This approach can be reexamined once we get the
  fully optimizing backend.
  The FSM states, lexical classes and transitions are documented in
  lexer-states.txt
  file. A simple syntax is used to describe the transitions and possible
  branching from one state to others. The FSM has three possible entry points:
  S_START, S_PATH and
  S_M_STRING. Parsing path items requires specific
  states even for common types. For curly-braced strings, it is necessary to
  exit the FSM on each occurrence of open/close curly braces in order to count
  the nested ones and accurately determine where it ends. In both those path and
  string cases, the FSM needs to be re-entered in a different state than
  S_START.
  In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition
  table data in binary. It basically looks like this:
FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
- Manually edit changes in the lexer-states.txt file.
 - Port those changes into the lexer.xlsx file by properly setting the transition values.
 - Save that Excel table in CSV format as lexer.csv.
 - Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
 
  The lexer code relies on several other tables for specific types handling like
  path ending detection, floating point numbers syntax validation, binary series
  and escaped characters decoding. Those tables are either manually written (not
  planned to be ever changed) or generated using
  this script.
Various other points worth mentioning:
  - The lexer works natively with UTF-8 encoded binary buffers as input. If a
  string! is provided as input, there is an overhead for converting internally
  such string to binary before passing it to the lexer. A unique internal buffer
  is used for those conversions with support for recursive calls.
  - The lexer uses a single accumulative cells buffer for storing loaded values,
  with an inlined any-block stack.
  - The lexer and lexer callbacks are fully recursive and GC-compliant.
  Currently callbacks can be function! only, this can be extended in the future
  to support routines also for much faster processing.

Very impressive. Congratulations :-)
ReplyDeleteCongratulations Doc & team.
ReplyDelete