October 30, 2019

A Deeper Dive Into the Fast-Lexer Changes

What made the fast-lexer branch a priority?


Several things. It started when @dockimbel looked into ticket #3606, which was impossible to fix currently, and we didn't want to give up on the auto-syncing between /text and /data facets. So he had to consider bigger options, including how to make the lexer instrumentable. It was not easy, because the current lexer is not re-entrant, so having the lexer emit events to a callback function could have caused serious problems.

Digging through all Red's repos showed that the current lexer code was duplicated twice, beyond the basic lexing needed by load: once in the console code, once in the VSCode plugin, each time for syntax coloring purposes, and each one lagging behind the original implementation. Not good.

@Dockimbel then considered changing the current lexer to make it instrumentable, but the changes were significant and would have made the parse rules much more complex. At the same time, @qtxie did some benchmarking, and the result showed Red's lexer was ~240 times slower than Rebol's. This is not due to parse, but  rather because the high-level rules were optimized for readability, not performance.

The lexer also caused delays in the VSCode plugin, because of its (lack of) performance. The high level code has served Red well, and was a showcase for parse, but loading larger data is also being used by community members, and data sizes will just keep growing. With some projects we have on the horizon, the lexer's performance became a higher priority.

As planned since the beginning (the lexer used to be R/S-only during the pre-Unicode era), @dockimbel decided the best option was to not postpone the conversion of the lexer to pure R/S code any longer, by porting R3's C-based lexer to R/S. After studying Rebol's lexer in detail, he realized that the code was quite complex in some places (mostly the prescanner), and would lead to less than optimal R/S code that would be hard to maintain.

Evaluating the state of the art in fast parsers for programming languages, he found inspiration in some unpublished papers. He then started prototyping the next lexer in R/S, and realized that it could be several times faster than Rebol's, with the additional benefit of much smaller and simpler code. Then he embarked on the full implementation. Knowing he and @qtxie would not have the opportunity to work on that for probably a year with all the big tasks ahead on the roadmap, he committed to it full time.

Red's new R/S lexer is half the size of Rebol's, far simpler, with more maintainable code, and it performs at similar speeds (sometimes a bit faster, sometimes a bit slower). That is a fantastic result, because it means that with an optimizing backend (Red/Pro), our lexer will be 4-8 times faster than R3's. It should then be possible to load gigabytes of Red data in memory in just a few
seconds (using the future 64-bit version). 😉

An additional benefit was brought by @qtxie, who added a hashtable for symbol lookup in Red contexts. That sped up word loading tremendously, and should have a measurable improvement on Red's start up time; especially on slow platforms like Raspberry Pi.

@Dockimbel is almost done with the lexer itself, just date! and time! to add, and it should be possible to replace the old one with the new one after thorough testing and debugging. Then, we'll add the hooks for a user-provided callback, allowing us to instrument the lexer in ways Redbolers could only dream about until now. One application of that will be the ability to implement "predictive loading," which will tell you the type and size of a Red value in a string, without loading it, and at extremely high speed (~370MB/s currently, 1-2GB/s with /Pro). Such a feature will allow us to finally address the #3606 issue with a very clean and efficient solution, while keeping the facet's auto-syncing feature.

4 comments:

Fork me on GitHub