October 30, 2019

A Deeper Dive Into the Fast-Lexer Changes

What made the fast-lexer branch a priority?


Several things. It started when @dockimbel looked into ticket #3606, which was impossible to fix currently, and we didn't want to give up on the auto-syncing between /text and /data facets. So he had to consider bigger options, including how to make the lexer instrumentable. It was not easy, because the current lexer is not re-entrant, so having the lexer emit events to a callback function could have caused serious problems.

Digging through all Red's repos showed that the current lexer code was duplicated twice, beyond the basic lexing needed by load: once in the console code, once in the VSCode plugin, each time for syntax coloring purposes, and each one lagging behind the original implementation. Not good.

@Dockimbel then considered changing the current lexer to make it instrumentable, but the changes were significant and would have made the parse rules much more complex. At the same time, @qtxie did some benchmarking, and the result showed Red's lexer was ~240 times slower than Rebol's. This is not due to parse, but  rather because the high-level rules were optimized for readability, not performance.

The lexer also caused delays in the VSCode plugin, because of its (lack of) performance. The high level code has served Red well, and was a showcase for parse, but loading larger data is also being used by community members, and data sizes will just keep growing. With some projects we have on the horizon, the lexer's performance became a higher priority.

As planned since the beginning (the lexer used to be R/S-only during the pre-Unicode era), @dockimbel decided the best option was to not postpone the conversion of the lexer to pure R/S code any longer, by porting R3's C-based lexer to R/S. After studying Rebol's lexer in detail, he realized that the code was quite complex in some places (mostly the prescanner), and would lead to less than optimal R/S code that would be hard to maintain.

Evaluating the state of the art in fast parsers for programming languages, he found inspiration in some unpublished papers. He then started prototyping the next lexer in R/S, and realized that it could be several times faster than Rebol's, with the additional benefit of much smaller and simpler code. Then he embarked on the full implementation. Knowing he and @qtxie would not have the opportunity to work on that for probably a year with all the big tasks ahead on the roadmap, he committed to it full time.

Red's new R/S lexer is half the size of Rebol's, far simpler, with more maintainable code, and it performs at similar speeds (sometimes a bit faster, sometimes a bit slower). That is a fantastic result, because it means that with an optimizing backend (Red/Pro), our lexer will be 4-8 times faster than R3's. It should then be possible to load gigabytes of Red data in memory in just a few
seconds (using the future 64-bit version). 😉

An additional benefit was brought by @qtxie, who added a hashtable for symbol lookup in Red contexts. That sped up word loading tremendously, and should have a measurable improvement on Red's start up time; especially on slow platforms like Raspberry Pi.

@Dockimbel is almost done with the lexer itself, just date! and time! to add, and it should be possible to replace the old one with the new one after thorough testing and debugging. Then, we'll add the hooks for a user-provided callback, allowing us to instrument the lexer in ways Redbolers could only dream about until now. One application of that will be the ability to implement "predictive loading," which will tell you the type and size of a Red value in a string, without loading it, and at extremely high speed (~370MB/s currently, 1-2GB/s with /Pro). Such a feature will allow us to finally address the #3606 issue with a very clean and efficient solution, while keeping the facet's auto-syncing feature.

October 25, 2019

October 2019 In Review

Over the last few weeks the Red Lang core team drilled down to make some truly great progress on Red's fast-lexer branch--while we also gained valuable support from the contributions of Red doers and makers as they consolidate a world of useful information and resources.


Fast-Lexer Benchmarks


In the fast-lexer branch of Red, you can see lots of new work from Red creator @dockimbel (Nenad Rakocevic) and core teammate @qxtie. Among other fixes and optimizations, they substituted a hashtable for what had previously been a large array in context!

The numbers so far: Loading 100'000 words (5 to 15 characters, 1MB file): Red (master): 19000ms.  Red (fast-lexer): 150ms. Nenad's observations on further testing:
"FYI, we just [ran] some simple benchmarks on the new low-level lexer for Red using 1M 10-digit integers. The new lexer completes the loading about 100 times faster than the current high-level one. Loading 1M 10-digit integers in one block: Red: 175ms; R2: 136ms; R3: 113ms. 
"We use a faster method than Rebol, relying on several lookup tables and a big FSM with pre-calculated transition table (while Rebol relies on a lot of code for scanning, with many branches, so bad for modern CPU with branch predictions). With an optimizing backend, Red's LOAD should in theory run 2-3 times faster than Rebol's one. (Though, we still need to optimize the symbol table loading in order to reach peak performance).  Given that Rebol relies on optimized C code while Red relies on sub-optimal code from R/S compiler, that speaks volume about the efficiency of our own approach. So, Red/Pro should give us a much faster LOAD.
"The lexer is not finished yet, but the hard part is done. We still need to figure out an efficient way to load keywords, like escaped character names (`^(line), ^(page), ...) and month nouns in dates."
This is a huge accomplishment, and it's shaping up to make future goals even more impressive. The fast-lexer branch is a work in progress, but stay tuned: Nenad has more to say about why it's been prioritized just now, which we will have in an upcoming post.


Red's MVPs Contribute New Resource Material & Tools


If you're new to Red, sometimes the flexibility of the language can leave you uncertain about which aggregate structure to use. In red/red's wiki on github, @9214 contributes a useful guide for those seeking to tease apart the differences. For example, map! works better with data that can be atomized, or framed as a conventional associative array, while hash! lends itself to data that will be queried at a high volume and which will require fewer updates. Learn further linguistic nuances, including object! and block!, as well as a useful comparison table of their algorithmic complexity, here@Rebolek, meanwhile, has furnished us with loads of useful information, diving deeper into code statistics. His value datatype distribution, here, his unofficial Red build archive here, and his rebolek/red-tools repo containing various tools--line parsers, codecs, APIs and documentation among them--are greatly appreciated. The tools repo has a number of new features you can check out here.


About Those Ports...


Wondering about port!Here's the latest. We've got port! in the master branch already, but low-level input/output networking abilities aren't complete yet, so we need to focus on this, and your feedback can always help. "We have a working async TCP and TLS ports implementation (both client and server-side)," explains Nenad, "but they still require more work to cover all the target platforms." Here, he goes on to explain the prerequisites for our team to complete this process; your thoughts and code contributions are welcomed.


Games and Experiments


It's a fun one to end this update on: Red community member @GalenIvanov's "Island Alleys," a game of unspooling Hamiltonian paths! A path of this type only allows its line, which inscribes a closed loop, to cross through a vertex within a graph once, a process which can lend itself to neural network-related interpretations. And @planetsizedcpu offers a wintry little spin on this repo. Enjoy, and thanks to all!
Fork me on GitHub