July 29, 2022

New Red binaries

Since many years, we are offering pre-built binaries for the Red toolchain, as a more convenient way to use Red, even if it is not strictly needed, as Red can be run from its sources, the toolchain being run by a Rebol2 interpreter. As the Red REPL and toolchain are not run by the same engine, the console (REPL) used to be compiled on first run of the `red` executable (when no arguments was provided or a Red script was passed). This resulted in a significant delay on the first use of the console (both for the GUI and CLI versions). 

We have now decided to change that by providing separate pre-built binaries for the consoles and toolchain. This is a temporary split until Red gets self-hosted, at which point we can recombine everything into a single binary.

Another change is the temporary dropping of the semantic versioning until version 1.0 and related "stable" releases, as it seems to be too confusing to some users (Red being still in alpha stage). This also will remove a tendency from some users to care more about version increments than feature availability and work being done overall. We will now be proposing only pre-built binaries for latest commit, though older binaries will still be available if that can be of any help to anyone.

So the pre-built binaries now are:
  • Red GUI : Red interpreter + View + GUI console
  • Red CLI : Red interpreter + CLI console
  • Red Toolchain : Encapper for Red + Red/System compiler

We are also considering ways to merge the GUI and CLI consoles into a single binary which can work even if no GUI API is available, falling back on CLI mode. We will also have the console(s) act as a front-end for the toolchain, even downloading it for you in the background when needed. Though for that we need a proper asynchronous `call` function implementation. More news about this soon.

In the meantime, enjoy running Red consoles almost instantly from just a click on the Download page!

July 14, 2022

The Road To 1.0

You cannot have missed that in the last months (and even last years), our overall progress has slowed down drastically. One of the main reasons is that we have spread our limited resources chasing different objectives while making little progress on the core language. That is not satisfying at all and would bring us most likely to a dead-end as we exhaust our funding. We have spent the last weeks discussing about how to change that. This is our updated action plan.

From now on, our only focus will be to finish the core language and bring it to the much-awaited version 1.0. We need to reach that point in order to kickstart a broader adoption and provide us and our users a stable and robust foundation upon which we can build commercial products and services necessary for sustainability.

Given the complexities involved in completing the language and bringing an implementation that can run on modern 64-bit platforms, we have devised a two-stage plan.

Upgrade the current 32-bit Red implementation

👉 Language specification

It is now time to do so in order to clean-up some semantic rules and address all possible edge cases which will help fulfill our goals of implementation robustness and stability. The process of writing down the complete language specs will result in dropping some features that we currently have that end up being problematic or inconsistent. OTOH, we might add some new features that will need to be implemented for 1.0.

👉 Modules

We need a proper module system in order to be scalable. We also need to have a proper package management system which will be tied to a central repo where we can gather third-party libraries. That would also enable modular/incremental compilation (or encapping) which will be most probably supported in the self-hosted toolchain.

👉 Concurrency

We need a proper model for concurrent execution in order to leverage multicore architectures. We will define one and make a prototype implementation in the 32-bit version.

👉 Toolchain

Before starting to work on the new toolchain, we will make some changes to the existing version in order to prepare for the transition. The biggest change is the dropping of the Red compiler, which will only act as a (smart) encapper. Routines and #system directives will still be supported, but probably with some restrictions. The Red preprocessor might also see some changes. This change means that Red will only have one execution model instead of the two it has currently. The Red compiler has become more of a burden than a help. The speed gains are not that significant in real code (even if they can be in some micro-benchmarks), but the impossibility for the compiler to support the exact same semantics as the interpreter is a bigger problem. This move not only will bring more stability by eliminating some edge case issues but also will reduce the toolchain by almost 25% in size, which will help reduce the number of features to support in the new toolchain.

👉 Runtime library

Some improvements are long overdue in the Red runtime library. Among them:

  • unified Red evaluation stack.
  • unified node! management.
  • improved processing of path calls with refinements.
  • improved object! semantics.

All those changes are meant to simplify, reduce the runtime library code and address some systemic issues (e.g. stack management issues and GC node leaks).

👉 Documentation

We need proper, exhaustive, user-oriented documentation for the Red core language. This is one of the mandatory tasks that needs to be completed and done well for wider adoption.

Self-hosted Red for 64-bit version

👉 Toolchain

In order to go 64-bit, we have to drop entirely our current toolchain code based on Rebol2 and rewrite it with a newer architecture in Red itself. The current toolchain code was disposable anyway, it was not meant to live this long, so this was a move we had to do for 1.0 anyway.

So the new toolchain will feature:

  • a new compilation pipeline with a plugin model.
  • an IR layer.
  • one or more optimizing layers.
  • modular/incremental compilation support.
  • x64, AArch64 and WASM backends.
  • linker support for 64-bit executable file formats for the big 3 OS.
  • support for linking third-party static libraries.
32-bit backends will not be supported in 1.0, though, they could be added back in the future.

👉 Runtime library

The current Red runtime library written in R/S will be kept and some adjustments will be needed in order to be fully compatible with a 64-bit environment (like updating all imported OS API to their 64-bit versions). 

View engine will not be part of that upgrade for 1.0, but will be done in a 1.1 version, priority is given to Red/Core for the 1.0.


Here are the main milestones:

  • v0.7   : Full I/O with async support.
  • v1.0b : (beta) completed self-hosted Red with 64-bit support.
  • v1.0r  : (release) first official stable and complete Red/Core language release.
  • v1.1   : View 64-bit release.
  • v1.2   : Android backend and toolchain release.
  • v1.3   : Red/C3 release.
  • v1.4   : Web backend for View release.
  • v2.0   : Red JIT-compiler release.
  • v3.0   : Red/...

The 0.7 should be the last version for the 32-bit Red version and current toolchain and we will be working on that first.

For reaching the 1.0-beta milestone, we target 12 months of intensive work, so that will bring us to Q3 2023. That's an ambitious goal but necessary to reach for the sake of Red's future.

The currently planned beta period for 1.0 is 2-3 months. We want a polished, rock-solid, production-ready 1.0 release.

For the 1.1, we will probably make some (needed) improvements to View engine architecture and backends.

For Red/C3, as the Ethereum network is transitioning to 2.0 and a new EVM, we need the WASM backend in order to support it.

Version 1.4 will bring a proper web runtime environment to the WASM backend, including GUI support.

The 2.0 will be focused on bringing a proper JIT-compiler to Red runtime, that should radically improve code execution of critical parts without having to drop to R/S.

Version 3.0 is already planned, but I will announce that once 1.0 will be released. ;-)

One major platform is missing from the above plan, that is iOS. Given how closed that platform is, we will need to come up with a specific plan on how to support it, as it won't be able to cross-compile for it (you would need a Mac computer), nor probably generate iOS apps without relying on Xcode at some point (not even mentioning dynamic code restrictions on the AppStore), which are layers of complexity that Red is trying to fight against in the first place... So for now, that platform is not among our priorities.

To finish, let me borrow some words from someone who succeeded more than anyone else in our industry:

Expect me to say "no" even more so from now on, as we get laser-focused on our primary goal.

Cheers and let's go!

December 31, 2021

2021 Winding Down

Another quarter, another blog post. Seems almost rushed after the previous drought. 

To set the stage, I'll start with a bit of a rant about complexity. If you just want the meat of what's happening in the Red world, feel free to skip the introduction. 

Complexity Considerations: Part 1

I liked what the InfoWorld article, Complexity is Killing Software Developers said, which we all know, about difficult domains (voice and image recognition, etc.) being available as APIs. This lets us tackle things we couldn't in some cases. Though I imagine @dockimbel or others also used Dragon Dictate's libraries back in the 90s. What we have now is massive data to train systems like that. Those work well, allowing us to add features we otherwise couldn't with a small team.

The problem I see is that the trend has become for everything to be outsourced, including simple features like logging, and those libraries have exploded. There must be graphs available to show the change. Moderately complex domains, UIs for example, have risen in number and lead to what @hiiamboris says about Brownian Movement. It's a random collection of things, not designed to work together, without a coherent vision. A quote from the above article says it this way:

"Complexity is less the issue than inconsistency in an environment."

It used to be that you could take a FORTRAN, COBOL, Lisp, VB, Pascal/Delphi, Access/PowerBuilder, dBase/Clipper/Paradox, or even a Java developer, drop them into a project, and they could work from a solid core, learning the team's custom bits and any commercial tools as they went. With JS leading the way, but not alone in this, a programmer can only rely on a much smaller core, relative to how many libraries are used.

Because those libraries, and the choices to use a particular combination of them were not designed to work together, there is no guarantee (or perhaps hope) of consistency to leverage. It's worse if you came from a history of other tools that were based on different principles or priorities, because you have to unlearn, breaking the patterns in your mind. Or you convince people to use what you did before, even if there is overlap with tools already in use.

Things are changing now, and will even more. New service-based companies are coming, and a drive to APIs rather than libraries. So we not only have risks like LeftPad, but also companies going out of business under you. The modern trend means it's no longer dependent on an author or team committed to a project long term, but to what investors want, and what changes are made to gain adoption at all costs. As a service-based company you can't hold dearly to design principles if the investors tell you to pivot. Because it's no longer about your vision, but their return. If it is a solo FOSS author or small team, what is their incentive to maintain a project for free, while others profit from it? Success can be your worst enemy, and we need a more equitable solution than what we have now. The software business model has changed dramatically, and will likely continue to do so.

Here is what I personally see as the crux of the problem: the goal of scaling. FOSS projects and companies are only considered successful if they have millions (or, indirectly, billions) of users. Companies that want to be sustainable, providing long term, moderate profits don't make headlines, but they make the world go 'round. They are not the next big social media disruption where end users are the product, to be bought and sold. It is a popular business model and profit is the goal. It's nothing personal.

This has led us to the thinking that every project needs to be designed for millions of users at the very least. Sub-second telemetry for all the data collected, another explosion, giving rise to data analytics for everyone; not just Business Intelligence (BI) for large companies. I won't argue against having data. I love data and learning from it. But I do believe there is a point of diminishing returns which is often ignored. Rather, in this case, there is a cost of entry that small projects wouldn't otherwise need to pay.

What do you do, as an "architect" (see the previous blog post about my thoughts on software architecture) or developer on a team? Your small team (we all know small teams are best, plenty of research and history there) simply can't design and build every piece to support these scaling demands, while the sword of Damocles hangs over you in the form of potential pivots (dramatic changes in goals).

As an industry, we are being inexorably forced to make these choices. Either you're a leader and make your own Faustian bargain, or you're in the general mass of developers being whipped and driven to the gates of Hell.

Only you, dear reader, can decide the turns this tragic story will take, and what you forgive in this telling perchance I should exaggerate.

Complexity Considerations: Part 2

Complexity doesn't come only in the form McCabe is famous for, the decision points in a piece of code, but in how many pieces there are and how often they change either by choice or necessity. Temporal Complexity if you will. This concept is unrelated to algorithmic time complexity. Rebol2 for any faults we can point out, still works to this day (except in cases where the world changed out from under it, e.g., in protocols). It was self-contained, and relied only on what the OS (Operating System) provided. As long as OSs don't break a core set of functionality that tools rely on, things keep working. R2 had a full GUI system (non-native, which insulated it from changes there), and I can only smile when I run code that is 20 years old and it works flawlessly. If that sounds silly, remember that technology, in most cases, is not the goal. It is a means to an end. A lot of very old code is still in production, keeping businesses running.

We talk about needing to keep up with changes, but some things don't change very much, if at all. Other things change rapidly, but for no good reason, and without being an improvement. If a change is just a lateral move there is no value in it, unless it is to align us on a different, and better, path in the future. I started programming with QuickBASIC, but also used other tools as I quickly learned my tool of choice came with a stigma attached, and I wanted to be a serious, "real" programmer. What became clear was that QB was a great tool, with a few companies providing terrific ASM libraries, and had a wonderful IDE to boot. It was simpler, not only as a language, but because every 12-18 months (the release cycle way back when) my new C compiler would break something in my code. But QB, and later BASIC/PDS and then VB very rarely broke working code. Temporal complexity.

Even then there were more complex options. The cool kids used Zortech C++ and there were various cross-platform GUI toolkits. But those advanced tools were often misapplied to simple projects. We still do that today. Much of that is human nature, and the nature of programmers. If it's easy we are no longer special. We may not mean to, but we make things harder than they need to be. Some of us are even elitist about what we do, to our own detriment. If you don't need to be cross platform, why do you have multiple machines or VMs each with a different compiler setup? If you need a GUI, why are you using a language that was not designed with them in mind? If you need easy deployment, which is simpler: a single EXE with no dependencies, or a containerization approach with all that entails? How many technologies do you need in your web stack? Are you the victim of peer pressure, where you feel your site has to be shiny and "responsive", or use the latest framework?

A big argument for using other's work is performance. They've taken time, and may be experts, to optimize Thing X far beyond what you could ever do. That JIT compiler, an incredible virtual DOM, such clever CSS tricks, the key-value DB with no limits, and yet...and yet our software is slower and more bloated than ever. How can that be? Is it possible we're overbuilding? Is software sprawl just something we accept now?

Earlier I mentioned that a hodge-podge assembly of parts that have no standards, norms, or even aesthetic sense applied does not make our lives easier. Lego blocks, the originals anyway, are limited, but consistent in how they can be used. We misapply that analogy, because the things we build are far from consistent or designed to interact. Even in the realm of UX and A/B testing on subsets of users that companies apply today. I love the idea of data-driven HCI to guide us to a more evidence-oriented approach. This includes languages. But when a site or service moves fast and changes their interface based on their own A/B testing, they don't account for the others doing the same. Temporal complexity.

As a user, every app or site I access may change out from under me in the flash of refresh or automatic update I didn't ask for. Maybe it's better, an actual improvement, if you only use that one site. But if all your tools constantly change out from under you, it's like someone sneaking into your office and rearranging it every night while you sleep. Maybe this is the developer's revenge, for the pain we inflict on ourselves by constantly changing our own tools. If we suffer, why shouldn't our users? For those who truly have empathy for their users and don't want to drive them mad, or away, perhaps the lesson is to have empathy for ourselves, for our own tribe. I don't want to see my friends and colleagues burn out, when it was probably the enjoyment and passion that solving problems with software can bring which led them here to begin with.

Every moving part in your system is a potential point of failure. Reduce the moving parts and reliability increases. Whether it's the OS you run on (we now have more of those than ever, between Linux distros and mobile platforms always trying to outdo each other), extra packages or commercial tools, FOSS libraries, environments, [?]aaS, or platform components like containers and cluster management, every single piece is a point of failure. And if any of them break your code, or your system, even in the name of improvements or bug fixes, you may find yourself running just to stay in the same place. Many of those pieces are touted as the solution to reliability problems, but a lot of them just push problems around, or target problems you don't have. Don't solve problems you don't have. That adds complexity, and now you really have a problem.

Less Philosophy, More Red

Interpreter Events

Having a debugger in Red has been a request of many users for a long time, even since the Rebol era. We have tackled this feature from a larger perspective, considering general instrumentation of the interpreter (note: not the compiler), extending it with an event system and user-provided event handlers, similar to how parse and lexer tracing operate today. This approach allows us to build more than just a debugger, though it was a lot of work to design and we expect it will be refined once people start using it in earnest. It's a brave new world, with a lot of tooling possibilities.

It's important to note that this is not magic. Because it operates as the interpreter evaluates values and expressions, including functions, it can't see into the future. In order to get a complete trace, you have to evaluate everything. That means we'll see tools which silently collect data, like a profiler does, which can later be viewed and analyzed, perhaps up to the point where an error occurred. This is an important aspect, and plays once again into the power of Red as data. Your event handlers can easily collect data into any structure or model you like. And because event handlers can filter events, you can tailor them for specific needs. It should even be possible to build interpreter level DTrace-like tools in the future. We also hope to build higher level observability and monitoring tools, based on eventing systems, in the future, but those are long term projects.

Event generation is not active by default, it is enabled using do/trace and by providing an event handler function. For example, here's a simple logging function:
  logger: function [
      event  [word!]                      ;-- Event name
      code   [any-block! none!]           ;-- Currently evaluated block
      offset [integer!]                   ;-- Offset in evaluated block
      value  [any-type!]                  ;-- Value currently processed
      ref    [any-type!]                  ;-- Reference of current call
      frame  [pair!]                      ;-- Stack frame start/top positions
      print [
          pad uppercase form event 8
          mold/part/flat either any-function? :value [:ref][:value] 20
Given this code:
  do/trace [print 1 + 2] :logger
It will output:
  INIT    none                    ;-- Initializing tracing mode
  ENTER   none                    ;-- Entering block to evaluate
  FETCH   print                   ;-- Fetching and evaluating `print` value
  OPEN    print                   ;-- Results in opening a new call stack frame
  FETCH   +                       ;-- Fetching and evaluating `+` infix operator
  OPEN    +                       ;-- Results in opening a new call stack frame
  FETCH   1                       ;-- Fetching left operand `1`
  PUSH    1                       ;-- Pushing integer! value `1` on stack
  FETCH   2                       ;-- Fetching and evaluating right operand
  PUSH    2                       ;-- Pushing integer! value `2`
  CALL    +                       ;-- Calling `+` operator
  RETURN  3                       ;-- Returning the resulting value
  CALL    print                   ;-- Calling `print`
  3                               ;-- Outputting 3
  RETURN  unset                   ;-- Returning the resulting value
  EXIT    none                    ;-- Exiting evaluated block
  END     none                    ;-- Ending tracing mode
Several tools are now provided in the Red runtime library, built on top of this event system:
  • An interactive debugger console, with many capabilities (step by step evaluation, a flexible breakpoint system, and call stack visualisation).
  • A simple profiler that we will improve over time (especially on the accuracy aspects).
  • A simple tracer. The current evaluation steps are quite low-level, but @hiiamboris has already built an extended version, operating at the expression level that will soon be integrated into the master branch.
Full docs are here.


Boy, I really thought this was was going to be easy, or at least not too hard. I couldn't have been more wrong. When I did my format experiments, I imagined at least some of the code would be useful, requiring polish and more work of course, providing a foundation to work from. It turns out that I missed a key aspect, and my approach was just one of many possible. @hiiamboris and @giesse both weighed in, and we chatted about specific parts. Then it sat idle for a while, and I asked Boris to take it over to get it into production. He identified the key missing piece, which would have limited its usefulness until we eventually had to address it. Better now than later. He also made a strong case for a different approach to the core masked-number and I told him to run with it. That led to a lot of design chat about one aspect, which is as yet undecided. It's not a fight to the death, but there has definitely been some sparring. :^)

The missing piece I've alluded to is Localization (L10N). As an American who has never had to develop software requiring Internationalization (I18N), I've been blissfully ignorant of all the aspects that come into play when Globalization (G11N) becomes part of the process. We have talked about how to implement L10N in Red, and have system/locale for a months, weekdays, and currency codes. The first two we inherited from Rebol's design, the latter was added when @9214 designed the currency! datatype. Thinking of locale data in a system catalog of some kind is easy enough, but how to actually apply it (and not apply it when necessary) is a different story entirely. And I mean entirely. Format forced us to start down this path, and is a guinea pig feature that will guide future plans for all future L10N work. But keep my complexity rant in mind. While we want to make it as easy as possible for Reducers to write globally aware apps, if you don't need it, don't do it. We don't yet know if we can make it so magical that you can write your app ignoring that for the most part, and then flip a switch, or simply include local data, and have it work. Don't get your hopes up. There's a lot that can go wrong with that approach.

We agreed to start with masked numbers but, in order to do that, L10N R&D had to be done. This led to broad and deep dives into unicode.org and other resources. While they cover far more than we need, and is overly complex in many cases (or just doesn't match our aesthetic sense for Red), the data they have there is enormously valuable, and we deeply appreciate it being available. We just draw the line around a smaller scope than they do, and no committees are involved where people fight to get their own bits included. Well, we do that too, to some extent. What Boris managed to do was identify the key elements needed for our work, and then wrote tools (using Red of course) to extract and reformat the data for use in Red. I can't stress how much work this was. Truly a heroic and mostly thankless effort most people will never know about.

In order to test masked number formatting, and give others an easy way to play, Boris created a Playground App and I can't tell you how important that was. You see, a particular piece of behavior came up while I was playing with it and got unexpected results. Unexpected to me, but Boris confirmed it was by design. I will just say here that it's about a significant digits mode, and let you play with the app from there. Named formats will be available, but everything will likely boil down to wrappers around masks, which should cover almost any need.

Next up is date formatting. This time I knew locales would play a role because some IETF RFCs specify that date elements be in English. So you may have localized dates for some things, but if you use RFC2822 dates or HTTP cookie dates, they must not be affected by any locale settings. Dates will use masks at the core, like numbers, because masks are an easy to understand WYSIWYG format. Well, easy if the masks make sense. If you look at printf and some other mask syntax, it can be quite obscure. By trying to cram things into a limited syntax, people end up using whatever low ASCII letters might be left over for some elements. We hope to avoid that. 

Our main choices are what Boris termed the stuttering format. e.g. MMDDYYYY/HHMMSS. Think in terms of "progressing in a hesitant or irregular way." rather than stuttering in terms of human speech. I prefer to call this a symbolic format, where the letters map to date elements. This, of course, isn't perfect. e.g. is MM month or minute? Context is required. We don't want to be case sensitive, or use other letters randomly to avoid that conflict. So there's an alternate approach; a literal mask. e.g. 1-Jan-2022. We're not the first to consider it, and it is in use elsewhere, but it's not a perfect solution either. Do masks have to be written in English terms, or can they use any locale? How do you disambiguate numbers (does 01-01 mean MM-DD or DD-MM, and how do you write that without the separator to get MMDD?) Does it make code more or less readable, because Red already has a literal date form, and it would add what look like literal dates as strings in code.

Play with the app, give us feedback, and stay tuned. We think this will be a crucial feature for a lot of users, and we want to make it the best it can be.


Like format, split seems a relatively simple subject at a glance. And if you limit it to basic functionality, it is. That's what other languages do, though some add a few extra features. See this table for examples. Wolfram appears quite broad in scope, because there are multiple variants for each named function. Something else common to all other languages is that they split only strings and sometimes byte arrays. In Red we have blocks, and while `parse` is great for string parsing, where it really shines is when applied to blocks to build dialects. We knew split should be block aware for more leverage. I (Gregg) helped design the version in R3, and used DiaGrammar to design a new dialected interface that aimed to extend the functionality. Wanting to do more evidence based language design, I also prototyped a small practice/playground app, thinking we'd put it out and see what kind of feedback we could get. 

Toomas stepped up and suggested an alternative, refinement-based, interface. He did a number of versions of that, and then we had to decide what to do next. There was a great deal of design discussion, still going on, about behavior details. Once you start adding options, it's easy for things to become confusing for the user. We need to strike a balance between ease of use and flexibility. Split is meant to handle the most common cases, and those with the most leverage, not every case. And while a refinement-based interface seems natural for Reducers, we also know how readable parse, draw, and VID dialects are. There are pros and cons to each, but we don't want dual interfaces, which will be confusing. If a function is dialected, any refinements should work in support of that dialect. So the test app was reimagined by @GalenIvanov to compare the two approaches.

Here's a screenshot of the test app, which we'll release to the community in January.

We learned by doing this that it's hard to compare them side by side, without having the user write full calls directly. That defeats some of the purpose, and the DRY principle, so we'll put this one out, then revise it based on feedback.

Markup Codec

Who knew that parsing HTML and XML would be the easy part? Well, many Reducers would. What they, and we on the team, might not have guessed, is just how hard it is to decide on a data format for the output. Red gives us many options, and XML gives us many headaches. The two formats, while closely related, also have some critical differences. Fortunately, once @rebolek set things up so we could play, and made the emitter modular, we could look at real examples and dive even deeper. What we discovered is that there is no perfect solution. No elegant model to fit all uses and cases. Key to many insights was @dander's input, as he works with XML a lot. Turns out, an infinitely extensible format is infinitely challenging to nail down.

Should we emphasize path access? Being data driven, people probably shouldn't hardcode their field names, but working with known data makes it a clear access model. Should attributes come before or after the text/content for a tag? As we learned, attributes aren't always small, so the locality argument isn't won either way there. Is it better to provide an interface to the structure and tell people to always use that, or to create a bland and obvious data structure that is possible to access in many ways? Will these things all complicate HOF access, which we know we want to leverage? How much do we need to care about efficiency? We don't want to be wasteful without purpose, but if we're too miserly, users may pay the price because it's harder to use. If we make more things implicit, do we paint ourselves into a corner somehow?

What we settled on was a modular approach, so there will be more than one standard emitter. What is yet undecided is how other emitters might be supported. They will likely be quite custom, as the standard versions will cover most needs. But is it worth making the system extensible? Once you have a result, it's easy to post-process into your preferred format. For now that's our recommended approach.

CLI Module

If you don't follow our channels on Gitter, you may not know about Boris' CLI module. It's very slick, very Reddish, and will become a standard part of Red in the near future. You won't believe how easy it is to create rich command line interfaces for your Red apps with this feature. Huge thanks to @hiiamboris for all his innovation and work on it.

IPv6 Datatype

It hasn't been merged to the mainline yet, but it's fully operational. You can see the code here, and some lexer tests here. You may be impressed that it's only a couple hundred lines of code, not counting the lexer changes, and think it was easy. It wasn't. As usual, there was a lot of design chat and compromise involved. For example, the name is not 100% finalized because, technically, the datatype itself is more generally applicable, being simply a vector of numbers internally. You can think of it like a tuple! on steroids. Less slots (8 vs 12), but each slot can hold a larger value (tuple! slots are limited to byte values).

Just as tuple! is a general name, used both for IPv4 addresses and colors, but also useful for other things, IPv6! could be used for things like GUIDs or extended time values. But the lexical form for GUID/UUID values is quite different, even ignoring the shortcut forms in the IPv6 specification. As you probably know, lexical space is tight in Red, and the colon is an important character in other places, and URL lexical forms were impacted, so this is a deep change and commitment, in that regard. Why do it then?

Because IPv6 networking support was already in place in Red, and IPv6 is the future. How often people will write literal URLs like http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.html we can't say. But we do know that addresses often end up in config files as data and that modern, dynamic systems generate addresses dynamically. They will appear in log files, messages, and more. As with the value of other lexical forms in Red, it's an important one that is part of our modern networking vocabulary.

Getting Near

@dockimbel created a new branch here, which will interest almost every Reducer. It's not ready yet, but expect it to be available in January. For those who used R2, you may recall that errors gave you a Near field, to hint at where the error occurred. Red will get this feature when the new branch is merged. e.g., in Red today you get this:

    >> 1 / 0
    *** Math Error: attempt to divide by zero
    *** Where: /
    *** Stack:  

Where in R2 you got this:

    >> 1 / 0
    ** Math Error: Attempt to divide by zero
    ** Near: 1 / 0

A little extra information goes a long way. We're anxious to see all the virtual smiles this features brings.

The Daily Grind

We closed roughly 120 tickets in 2021, that's 10 per month. We also merged almost 50 PRs. These numbers don't sound large, but when you consider how much time and effort may go into the deep ones, along with all the other work done, it's steady progress. We'd love for both tickets and pending PRs to be at zero, but that's not practical for a project like Red. The deep core team must have uninterrupted time for design and bigger, more complex tasks.


Q4 2021 (retrospective)

  • We hoped to have `format` and `split` deployed, but they will push back to Jan-2022.
  • `CLI` module approved, needs to be merged, then refined as necessary.
  • `Markup Codec` took longer than expected due to extensive design chat on formats.
  • Interpreter instrumentation, with PoC debugger and profiler. Took longer than expected, but are out now.
  • Async I/O, out but some extra bits didn't make it in. One unplanned addition was `IPv6!` as a datatype. It's experimental, and subject to change.
  • @galenivanov did some great work on his animation dialect, but @toomasv's `diagram` dialect took a back seat and will move to Q1 2022.
  • Audio has 3 working back ends and a basic port implementation. Next up is higher level design, device and format enumeration, and device control. A `port!` may not be the way to go for all this, but it was step one.
  • Animation has more great examples all the time. Like this and this. @GalenIvanov is doing great work, and we are planning to make his dialect a standard addition to Red.


I'm not going to list items in any particular order, because our plans often change. This way you have things to look forward to, but still with an element of surprise.

  • `Table` module, `node!` datatype and other REP reviews
  • Full HTTP/S protocol and basic web server framework
  • New DiaGrammar release
  • Animation dialect
  • New release process
  • New web sites updated and live
  • Red/C3 (Including ETH 2.0 client protocol)
  • Red Language Specification (Principles, Core Language, Evaluation Rules, Datatype Specs (including literal forms), Action/Native specs, Modules spec.
  • 64-bit support (LLVM was a possibility, but we learned from Zig that LLVM breaking changes can be quite painful for small teams to keep up with. We may be better off continuing to roll our own, though it's a big task.)
  • Android update
  • Red Spaces cross-platform GUI
  • Module and package system design
  • RAPIDE (Rapid API Development Environment)

RAPIDE, from Redlake Technologies

If you've used Postman or Insomnia, you know what the most popular tools in the API IDE space look like today. If you haven't used them, but use APIs, they're worth a look. For all that those tools do, and there are a few other players in the space, there is a lot they don't do. We think we can add a lot of value in the API arena, thanks to Red's superpowers and how important data-centric thinking is. For example, testing a group or series of APIs together seems like it could be greatly improved. Also, how APIs are found, and collaboration possibilities.

While we haven't set a release date, the plan is to start work on RAPIDE in Q2 2022, after we wrap up some infrastructure pieces it will rely on. 

In conclusion

Happy New Year to all, and may 2022 see us all healthy, happy, and writing more Red. :^)

August 4, 2021

Long Time No Blog

 It's been almost a year since our last blog post. Sorry about that. It's one of those things that falls off our radar without a person dedicated to it, and we run lean so don't have anyone filling that role right now. We know it's important, even if we have many other channels where people can get information. So here we are.

Last year was a tough year all around, even for us. We were already a remote-only team, but the effect the pandemic had on the world, particularly travel, hit us too. We had some team changes, and also split our focus into product development alongside core Red Language development. This is necessary for sustainability, because people don't pay for programming languages, and they don't pay for Open Source software. There's no need to comment on the exceptions to these cases, because they are exceptions. The commercial goal, starting out, is to focus on our core strengths and knowledge, building developer-centric tools. Our first product, DiaGrammar for Windows, was released in December 2020, and we've issued a number of updates to it since then. Our thanks to Toomasv for his ingenuity and dedication in creating DiaGrammar. We are a team, but he really accepted ownership of the project and took it from an idea to a great product. Truly, there is nothing else like it on the market. 

We learned a lot from the process of creating a product, and will apply that experience moving forward. An important lesson is that the product itself is only half the work. As technologists, we're used to writing the code and maybe writing some docs to go with it. We don't think about outreach, marketing, payments, support, upgrade processes for users, web site issues, announcements, and more. The first time you do something is the hardest, and we're excited to improve and learn more as we update DiaGrammar and work on our next product. We'll probably announce what it will be in Q4. One thing we can say right now is that the work on DiaGrammar led to a huge amount of work on a more general diagramming subsystem for Red. It's really exciting, and we'll talk more about that in a future blog post.

So what have we been doing?

Since our last blog post we've logged over 400 fixes and 100 features into Red itself. Some of these are small, but important, others are headline-worthy; some are deep voodoo and some visible to every Reducer (what we call Red users). For example, most people use the console (the REPL), so the fixes and improvements there are easy to see. A prime feature being that the GUI console, but not the CLI console, didn't show output if the UI couldn't process events. This could happen if you printed output in a tight loop. The results would only show up at the termination of the loop, when the system could breathe again. That's been addressed, but wasn't easy and still isn't perfect. Red is still single threaded, so there's no separate UI thread (pros and cons there). We make these tradeoffs every day, and need feedback from users and real world scenarios to help find the right balance. Less obvious are things like improvements to parse, which not everyone uses. Or how fmod works across platforms, and edge cases for lexical forms (e.g. is -1.#NaN valid?). The latter is particularly important, because Red is a data language first.

JSON is widely used, but people may not notice that the JSON decoder is 20x faster now, unless they're dealing with extremely large JSON datasets. JSON is so widely used that we felt the time spent, and the tradeoffs made, were worth it. It also nicely shows one of Red's strengths. Profiling showed that the codec spent a lot of time in its unescape function. @hiiamboris rewrote that as a Red/System routine, tweaked it, and got a massive speedup. No external compiler needed, no need to use C, and the code is inlined so it's all in context. Should your JSON be malformed, you'll also get nicer error information now. As always, Red gives you options. Use high level Red as much as possible, for the most concise and flexible code, but drop into Red/System when it makes sense.

Some features cross the boundary of what's visible. A huge amount of work went into D2D support on Windows. D2D is Direct2D, the hardware-accelerated backend for vector graphics on Windows. For users, nothing should change as all the details are hidden. But the rendering behavior is not exactly the same. We try to work around that, but sometimes users have to make adjustments as well; we know because DiaGrammar is written in Red and uses the draw dialect heavily. It's an important step forward, but comes at a cost. GDI+ is now a legacy graphics back end, and won't see regular updates. Time marches on and we need to look forward. As if @qtxie wasn't busy enough with that, he and @dockimbel also pushed Full I/O forward in a big way. It hasn't been merged into the main branch yet, but we expect that to happen soon. @rebolek has been banging on it, and has a full working HTTP protocol ready to go, which is great. TLS/SSL support gets an A+ rating, which is also a testament to the design and implementation. It's important to note that the new I/O system is a low level interface. The higher level API is still being designed. At the highest level, these details will all be hidden from users. You'll continue to use read, write, save, load exactly as you do today, unless you need async I/O. 

Another big "feature" came from @vazub: NetBSD support. The core team has to focus on what stands to help the project overall, with regard to users and visibility. Community support for lesser known platforms is key. If you're on one of those platforms, be (or find) a champion. We'll help all we can, but that's what Open Source is for. Thanks for this contribution @vazub!

We also have some new Python primers up, thanks to @GalenIvanov. Start at Coming-to-Red-from-Python. Information like this is enormously important. Red is quite different from other languages, and learning any new language can be hard. We're used to a set of functionality and behaviors, which sometimes makes the syntax the easiest part to learn. Just knowing what things are called is a learning curve. Red doesn't use the same names, because we (and Carl when he designed Rebol) took a more holistic view. That's a hard sell though. We feel the pain. A user who found Red posted a video as they tried to do some basic things. We learned a lot from watching it. Where other languages required you to import a networking library, it's already built into Red. When they were looking for request or http.get, and expecting strings to be used for URLs, they couldn't find answers. In Red you just read http://.... It's obvious to us, but not to the rest of the world. So these new primers are very exciting. We have reference docs, and Red by Example, but still haven't written a User's Guide for Red. We'll get there though. 

Why do things take so long?

Even with that many fixes and features logged, and huge amounts of R&D, it can still feel like progress is slow. The world moves fast, and software projects are often judged by their velocity. We even judge ourselves that way, and have to be reminded to stay the course, our course, rather than imitating others. Red's flexibility also comes into play. Where other languages may limit how you can express solutions, we don't. It's so flexible that people can do crazy things or perform advanced tricks which end up being logged as bugs and wishes. Sometimes we say No (a lot of times in fact), but we also try to keep an open mind. We have to ask "Should that be allowed?", "Why would you want to do that (even though I never have)?", and "What are the long term consequences?" We have to acknowledge that Red is a data format first, and we never want to break that. It has to evolve, but not breaking the format is fundamental. And while code is expected to change, once people depend on a function or library it causes them pain if we break compatibility. We don't want to do that, though sometimes we will for the greater good and the long view. There are technical bandages we can patch over things, but it's a big issue that doesn't have a single solution. Not just for us, but for all software development. We'll talk more about this in the future as well.

I'll note some internal projects related to our "slow and steady" process:

  • Composite is a simple function that does for strings what compose does for blocks. It's a basic interpolator. But the design has taken many turns. Not just in the possible notations, but whether it should be a mezzanine function, a macro, or both. Each has pros and cons (Side note: we don't often think about "cons" being an abbreviation for "consequences"). This simple design and discussion is stalled again, because another option would be a new literal form for interpolated strings. That's what other languages do, but is it a good fit for Red? We belabor the point of how tight lexical space is already, so have to weigh that against the value of a concise notation.
  • Non-native GUI. Red's native GUI system was chosen in response to Rebol's choice to go non-native. Unfortunately it's another case of needing both. Being cross platform is great for Red users, but Hell for us. Throw in mobile and it's even worse. Don't even talk about running in the browser. But every platform has native widget limitations. Once you move beyond static text, editable fields, buttons, and simple lists, you're in the realm of "never the twain shall meet". How do you define and interact with grids and tables or collapsible trees? Red already has its own rich-text widget, so you don't have to embed (even if you could) an entire web browser and then write in HTML and CSS. To address all this, with much research and extensive use case outlines, @hiiamboris has spent a lot of time and effort on Red Spaces. Show me native widgets that can do editable spiral text, put any layout inside a rotator, or define recursive UIs. I didn't think so. Oh, and the wiggling you see in the GIFs there are not mistakes or artifacts, they are tests to show that any piece of the UI can be animated.
  • Other projects include format, split, HOFs, and modules, each with a great deal of design work and thought put into them. As an example, look at Boris' HOF analysis. They are large and important pieces, based on historical and contemporary research, but not something we will just drop into Red, though we could. A simple map function is a no-brainer, and could have been there day one. But that's not how we work. It's not a contest to see how many features we can add, or how fast; but how we can move software forward, make things easier, and push the state of the art. Not just in technical features (the engineering part), but in the design of a language and its ecosystem.

Not Everyone Has These Problems

An important aspect of Red is being self-contained. We talk about this a lot. Yes, we're considering LLVM as a target, but that has a big cost, not just benefits. Using our own compiler for everything also has costs, like slowing the move to 64-bit which is an issue for Mac users now. Workarounds like VMs and Docker containers are just that. We want things to be easy for you, but that doesn't mean they're easy for us. Here's an example.

Boris found a bug related to printing time values in Red/System. @dockimbel finally tracked it down, and posted this investigative report:

@hiiamboris It was a (R/S) compiler issue afterall. ;-)size? a was the guilty part. The compiler was wrongly generating code for loading a even though size? is statically evaluated by the compiler and replaced by a static integer value. Given that a was a float type, its value was pushed onto the x87 FPU stack, but never popped. That stack has a 7 slots limit. Running the loop 5 times was enough to leave only 2 slots free. When the big float expression is encountered in dtoa library, it requires 3 free slots on the FPU stack, which fails and results in producing a NaN value, which wreaks havoc in the rest of the code.

The fix in the compiler was trivial (fetch-expression/final vs fetch-expression) but getting to that point was not. Understanding machine architectures at the lowest levels isn't for everyone, but even though our compiler code will be rewritten in the future, it's small and maintainable today. If we rely on GCC, Clang, or other compilers, hitting a bug may mean hitting a wall. So while there are costs to using our own compiler, there are also costs to depending on others. Robert Heinlein popularized  TANSTAAFL, but the concept is not science fiction. As a side note, just as we moved from GDI+ to D2D,  x87 for float support was an early choice meant to support older platforms and we are planning to switch to SSE.
If compilers are your thing, or you like system level programming, join our community and get to know Red/System. See how our toolchain works, and consider joining us.

The Big Picture

I just read 101 things I learned in architecture school (which I heard about via Kevlin Henney, (though it may not have been that specific talk) and what struck me the most about it is how we've commandeered the word "architecture" for software but completely removed the human aspect. An architect does so much more than we do. Software architects are really structural engineers. If a single developer builds a complete app, they have to do the UI. They engineer a building and then slap on whatever sheathing is at hand, cutting doors and windows without concern for their location. And the app is viewed in isolation, as if it's the only thing a user has on their system, without consideration for its site, context, or relationships. What makes real architecture hard (and why the author notes that many architects hit their stride when they are older), is that you have to know so much. So many considerations, disciplines, and constraints are involved, and you have to unify them. It's both creative and scientific. What makes great architecture great is that it makes your experience better. Maybe even wonderful. If we only think about the mechanical aspects, our software will never be beautiful.

We haven't articulated this view for what we do, I think because we didn't realize it. At least I didn't. We talk about the whole being greater than the sum of its parts, and not just making everything libraries so it feels more natural and less mechanical. How a REPL and single exe make it easier to get started, and not having to use many tools is better. But we haven't explicitly said "Here's how it's laid out, and why. Here's how it's put together; these are the critical elements. Here's what it looks like from a distance, and when you enter its space." Implicitly we do that every day, through the work, but we don't talk about it. Or only once a year.

August 20, 2020

Red/System: New Features

In the past months, many new features were added to Red/System, the low-level dialect embedded in Red. Here is a sum up if you missed them.


During the work on the low-level parts of the new Red lexer, the need arised for intra-function factorization abilities to keep the lexer code as DRY as possible. Subroutines were introduced to solve that. They act as the GOSUB directive from Basic language. They are defined as a separate block of code inside a function's body and are called like regular functions (but without any arguments). So they are much lighter and faster than real function calls and require just one slot of stack space to store the return address. 

The declaration syntax is straightforward:

    <name>: [<body>]

    <name> : subroutine's name (local variable).
    <body> : subroutine's code (regular R/S code).

To define a subroutine, you need to declare a local variable with the subroutine! datatype, then set that variable to a block of code. You can then invoke the subroutine by calling its name from anywhere in the function body (but after the subroutine own definition).

Here is a first example of a fictive function processing I/O events:

    process: func [buf [byte-ptr!] event [integer!] return: [integer!]
        /local log do-error [subroutine!]
        log: [print-line [">>" tab e "<<"]]
        do-error: [print-line ["** Error:" e] return 1]
        switch event [
            EVT_OPEN  [e: "OPEN"  log unless connect buf [do-error]]
            EVT_READ  [e: "READ"  log unless receive buf [do-error]]
EVT_WRITE [e: "WRITE" log unless send buf [do-error]]
EVT_CLOSE [e: "CLOSE" log unless close buf [do-error]]
default [e: "<unknown>" do-error] ] 0 ]

This second example is more complete. It shows how subroutines can be combined and how values can be returned from a subroutine:

    #enum modes! [

    convert: func [mode [modes!] text [c-string!] return: [c-string!]
            lower? upper? alpha? do-conv [subroutine!]
            delta [integer!]
            s     [c-string!]
            c     [byte!]
        lower?:  [all [#"a" <= c c <= #"z"]]
        upper?:  [all [#"A" <= c c <= #"Z"]]
        alpha?:  [any [lower? upper?]]
        do-conv: [s/1: s/1 + delta]
        delta:   0
        s:       text

        while [s/1 <> null-byte][
            c: s/1
            if alpha? [
                switch mode [
                    CONV_UPPER  [if lower? [delta: -32 do-conv]]
                    CONV_LOWER  [if upper? [delta: 32 do-conv]]
                    CONV_INVERT [delta: either upper? [32][-32] do-conv]
                    default     [assert false]
            s: s + 1
    probe convert CONV_UPPER "Hello World!"
    probe convert CONV_LOWER "There ARE 123 Dogs."
    probe convert CONV_INVERT "This SHOULD be INVERTED!"

will output:

    there are 123 dogs.
    tHIS should BE inverted!

Support for getting a subroutine address and dispatching dynamically on it is planned to be added in the future (something akin computed GOTO). More examples of subroutines can be found in the new lexer code, like in the load-date function.

New system intrinsics

Several new extensions to the system path have been added.

Lock-free atomic intrinsics

A simple low-level OS threads wrapper API has been added internally to the Red runtime as preliminary work on supporting parts of IO concurrency and parallel processing in the future. In order to complement it, a set of atomic intrinsics were added to enable the implementation of lock-free and wait-free algorithms in a multithreaded execution context.

The new atomic intrinsics are all documented here. Here is a quick overview:
  • system/atomic/fence: generates a read/write data memory barrier.
  • system/atomic/load: thread-safe atomic read from a given memory location.
  • system/atomic/store: thread-safe atomic write to a given memory location.
  • system/atomic/cas: thread-safe atomic compare&swap to a given memory location.
  • system/atomic/<math-op>: thread-safe atomic math or bitwise operation to a given memory location (add, sub, or, xor, and).

Other new intrinsics
  • system/stack/allocate/zero: allocates a storage space on stack and zero-fill it.
  • system/stack/push-all: saves all registers to stack.
  • system/stack/pop-all: restores all registers from stack.
  • system/fpu/status: retrieves the FPU exception bits status as a 32-bit integer.

Improved literal arrays

The main change is the removal of the hidden size inside the /0 index slot. The size of a literal array can now only be retrieved using the size? keyword, which is resolved at compile time (rather than run-time for /0 index access).

A notable addition is the support for binary arrays. Those arrays can be used to store byte-oriented tables or embed arbitray binary data into the source code. For example:

    table: #{0042FA0100CAFE00AA}
    probe size? table                      ;-- outputs 9
    probe table/2                          ;-- outputs "B"
    probe as integer! table/2              ;-- outputs 66
The new Red lexer code uses them extensively.

Variables and arguments grouping

It is now possible to group the type declaration for local variables and function arguments. For example:

    foo: func [
        src dst    [byte-ptr!]
        mode delta [integer!]
        return:    [integer!]
            p q buf  [byte-ptr!]
            s1 s2 s3 [c-string!]

Note that the compiler supports those features through code expansion at compile time, so that error reports could show each argument or variable having its own type declaration.

Integer division handling

Integer division handling at low-level has notorious shortcomings with different handling for each edge case depending on the hardware platform. Intel IA-32 architecture tends to handle those cases in  a slightly safer way, while ARM architecture produces erroneous results silently typically for the following two cases:

  • division by zero
  • division overflow (-2147483648 / -1)

IA-32 CPU will generate an exception, while ARM ones will return invalid results (respectively 0 and -2147483648). This makes it difficult to produce code that will behave the same on both architectures when integer divisions are used. In order to reduce this gap, R/S compiler will now generate extra code to detect those cases for ARM targets and raise a runtime exception. Such extra checkings for ARM are produced only in debug compilation mode. In release mode, priority is given to performance, no runtime exception will occur in such cases on ARM (as the overhead is significant). So, be sure to check your code on ARM platform thoroughly in debug mode before releasing it. This is not a perfect solution, but at least, it makes it possible to detect those cases through testing in debug mode.


Here is a list of other changes and fixes in no particular order:

  • Cross-referenced aliased fields in structs defined in same context are now allowed. Example:

        a!: alias struct! [next [b!] prev [b!]]
        b!: alias struct! [a [a!] c [integer!]]
  • -0.0 special float literal is now supported.
  • +1.#INF is also now supported as valid literal in addition to 1.#INF for positive infinite.
  • Context-aware get-words resolution.
  • New #inline directive to inline assembled binary code.
  • Dropped support for % and // operators on float types, as they were relying on FPU's relative support, the results were not reliable across platforms. Use fmod function instead from now on.
  • Added --show-func-map compilation option: when used, it will output a map of R/S function addresses/names, to ease low-level debugging.
  • FIX: issue #4102: ASSERT false doesn't work.
  • FIX: issue #4038: cast integer to float32 in math expression gives wrong result.
  • FIX: byte! to integer! conversion not happening in some cases. Example: i: as-integer (p/1 - #"0")
  • FIX: compiler state not fully cleaned up after premature termination. This affects multiple compilation jobs done in the same Rebol2 session, resulting in weird compilation errors.
  • FIX: issue #4414: round-trip pointer casting returns an incorrect result in some cases.
  • FIX: literal arrays containing true/false words could corrupt the array. Example: a: ["hello" true "world" false]
  • FIX: improved error report on bad declare argument.

August 3, 2020

A New Fast and Flexible Lexer

A programming language lexer is the part in charge of converting textual code representation into a structured memory representation. In Red, it is accomplished by the load function, which calls the lower-level transcode native. Until now, Red was relying on a lexer entirely written using the Parse dialect. Though, the parsing rules were constructed to be easily maintained and not for performance. Rewriting those rules to speed them up could have been possible, but rewriting the lexer entirely in Red/System would give the ultimate performance. It might not matter for most user scripts, but given that Red is also a data format, we need a solution for fast (near-instant) loading of huge quantities of Red values stored in files or transferred through the network.

The new lexer main features are:
  • High performance, typically 50 to 200 times faster than the older one.
  • New scanning features: identify values and their datatypes without loading them.
  • Instrumentation: customize the lexer's behavior at will using an event-oriented API.

The reference documentation is available there. This new lexer is available in Red's auto-builds since June.


Vastly increased performance is the main driver for this new lexer. Here is a little benchmark to let you appreciate how far it gets.

The benchmarking tasks are:
  • 100 x compiler.r: loads 100 times compiler.r source file from memory (~126KB, so about ~12MB in total).
  • 1M short integers: loads a string of 1 million `1` separated by a space.
  • 1M long integers: loads a string of 1 million `123456789` separated by a space.
  • 1M dates: loads a string of 1 million `26/12/2019/10:18:25` separated by a space.
  • 1M characters: loads a string of 1 million `#"A"` separated by a space.
  • 1M escaped characters: loads a string of 1 million `#"^(1234)"` separated by a space.
  • 1M words: loads a string of 1 million `random "abcdefghijk"` separated by a space.
  • 100K words: loads a string of 100 thousands `random "abcdefghijk"` separated by a space.

And the results are (on a Core i7-4790K):
    Loading Task             v0.6.4 (sec)    Current (sec)    Gain factor
    100 x compiler.r	      41.871            0.463	           90
    1M short integers	      14.295            0.071	          201
    1M long integers	      18.105            0.159	          114
    1M dates	              29.319	        0.389	           75
    1M characters             14.865            0.092             162
    1M escaped characters     14.909	        0.120             124
    1M words	                 n/a	        1.216	          n/a
    100K words	              23.183	        0.070	          331

- Only transcode is used in the loading tasks (system/lexer/transcode in 0.6.4).

- The "1M words" task fails on 0.6.4 as the symbol table expansion time is exponential due to some hashtable bugs. That also explains the big gap for the "100K words" task. Those issues are fixed in the current version and the symbol table further optimized for speed. Though, the execution time increase between 100K and 1M words tests in new lexer is not linear which may be explained by a high number of collisions in the internal hashtable due to limited input variability.

- The 0.6.4's lexer can only process strings as input, while the new lexer only processes internally only UTF-8 binary inputs. The input strings were converted to the lexer's native format in order to more accurately compare their speed. Providing a string instead of a binary series as input to the new lexer incurs on average a ~10% speed penalty.


It is now possible to only scan tokens instead of loading them. Basically, that means identifying a token's length and type without loading it (so without requiring extra memory and processing time). This is achieved by using the new scan native.
    >> scan "123"
    == integer!
    >> scan "w:"
    == set-word!
    >> scan "user@domain.com"
    == email!
    >> scan "123a"
    == error!

It is possible to achieve even higher scanning speed by giving up a bit on accuracy. That is the purpose of the scan/fast refinement. It trades maximum performance for type recognition accuracy. You can find the list of "guessed" types in the table there.

    >> scan/fast "123"
    == integer!
    >> scan/fast "a:"
    == word!
    >> scan/fast "a/b"
    == path!

Scanning applies to the first token in the input series. When an iterative application is needed in order to scan all tokens from a given input, the /next refinement can be used for that. It will return the input series past the current token allowing to get the precise token size in the input string. It can be used in combination with /fast if required. For example:
    src: "hello 123 world 456.789"
    until [
        probe first src: scan/next src
        empty? src: src/2

Matching by datatype in Parse

The new lexer enables also matching by datatype directly from Parse dialect. Though, this feature is limited to binary input only.
    >> parse to-binary "Hello 2020 World!" [word! integer! word!]
    == true
    >> parse to-binary "My IP is" [3 word! copy ip tuple!]
    == true
    >> ip
    == #{203139322E3136382E302E31}
    >> load ip
Notice that the whitespaces in front of tokens are skipped automatically in this matching mode.


Lexers in Red and Rebol world used to be black boxes, this is no longer the case with Red's new lexer and its tracing capabilities. It is now possible to provide a callback function that will be called upon lexer events triggered while parsing tokens. It gives deeper control to users, for example allowing to:
  • Trace the behavior of the lexer for debugging or statistical purposes.
  • Catch errors and resume loading by skipping invalid data.
  • On-the-fly input transformation (to remove/alter some non-loadable parts).
  • Extend the lexer with new lexical forms.
  • Process serialized Red data without having to fully load the input.
  • Extract line comments that would be lost otherwise.

Lexer's tracing mode is activated by using the /trace refinement on transcode. The syntax is:
    transcode/trace <input> <callback>

    <input>    : series to load (binary! string!).
    <callback> : a callback function to process lexer events (function!).
That function is called on specific events generated by the lexer: prescan, scan, load, open, close, error. The callback function and events specification can be found there

A default tracing callback is provided in system/lexer/tracer:
    >> transcode/trace "hello 123" :system/lexer/tracer
    prescan word 1x6 1 " 123"
    scan word 1x6 1 " 123"
    load word hello 1 " 123"
    prescan integer 7x10 1 ""
    scan integer 7x10 1 ""
    load integer 123 1 ""
    == [hello 123]
That tracing function will simply print the lexer event info. If a syntax error occurs, it will cancel it and resume on the next character after the error position.

Several more sophisticated examples can be found on our red/code repository.

Implementation notes

This new lexer has been specifically prototyped and designed for performance. It relies on a token-oriented pipelined approach consisting of 3 stages: prescanning, scanning and loading.

Prescanning is achieved using only a tight loop and a state machine (FSM). The loop reads UTF-8 encoded input characters one byte at a time. Each byte is identified as part of a lexical class. The lexical class is then used to transition from one state to another in the FSM, using a big transition table. Once a terminal state (state names with a `T_` prefix) or input's end is reached, the loop exits, leading to the next stage. The result of the prescanning stage is to locate a token begin/end positions and give a pretty accurate guess about the token's datatype. It can also detect some syntax errors if the FSM cannot reach a proper datatype terminal state. This approach provides the fastest possible speed for tokens detection, but it cannot be fully accurate, nor can it validate deeply the token content for some complex types (e.g. dates). 

Adding more states would provide greater accuracy and cover more syntatic forms, but at the cost of growing the transition table a lot due to the need to duplicate many state. Currently the table weights 2440 bytes, which is already quite big to be kept entirely in the CPU data cache (usually 8, 16 or 32KB per core, the lexical table uses 1024 bytes and there two other minor tables used in the tight loop). The data cache also needs to handle the parsed input data and part of the native stack, so the available space is limited.

The tight loop code is also optimized for keeping branch mispredictions as low as possible. It currently only relies on two branchings. The loop code could be also further reduced by, for example, pre-multiplying the state values to avoid the multiplication when calculating the table entry offset. Though, we need to wait for a fully optimizing code generation backend before trying to extract more performance from that loop code, or we might be taking wrong directions.

Scanning stage happens when a token has been identified. It consists in eventually calling a scanner function to deep-check the token for errors and more accurately determine the datatype. Loading stage then follows (unless only scanning was requested by the user). It will eventually call a loader function that will construct the Red value out of the token. In case of any-block series, the scanners will actually do the series construction on reaching the ending delimiter (which requires special handling for paths), so no loader is needed there. Conversely, loaders can be invoked in validating mode only (not constructing the value), in order to avoid code duplication when complex code is required for decoding/validating the token (e.g. date!, time!, strings with UTF-8 decoding,...).

For the record, there was an attempt at creating specific FSM for date! and time! literal forms parsing, to reduce the amount of rules that need to be handled by pure code. The results were not conclusive, as the amount of code required for special case handling was still significant and the performance of the FSM parsing loop was below the current pure code version. This approach can be reexamined once we get the fully optimizing backend.

The FSM states, lexical classes and transitions are documented in lexer-states.txt file. A simple syntax is used to describe the transitions and possible branching from one state to others. The FSM has three possible entry points: S_START, S_PATH and S_M_STRING. Parsing path items requires specific states even for common types. For curly-braced strings, it is necessary to exit the FSM on each occurrence of open/close curly braces in order to count the nested ones and accurately determine where it ends. In both those path and string cases, the FSM needs to be re-entered in a different state than S_START.

In order to build the FSM transition table, there is a workflow that goes from that lexer-states.txt file to the final transition table data in binary. It basically looks like this:
    FSM graph -> Excel file -> CSV file -> binary table
The more detailed steps are:
  1. Manually edit changes in the lexer-states.txt file.
  2. Port those changes into the lexer.xlsx file by properly setting the transition values.
  3. Save that Excel table in CSV format as lexer.csv.
  4. Run the generate-lexer-table.red script from Red repo root folder. The lexer-transitions.reds file is regenerated.
The lexer code relies on several other tables for specific types handling like path ending detection, floating point numbers syntax validation, binary series and escaped characters decoding. Those tables are either manually written (not planned to be ever changed) or generated using this script.

Various other points worth mentioning:

- The lexer works natively with UTF-8 encoded binary buffers as input. If a string! is provided as input, there is an overhead for converting internally such string to binary before passing it to the lexer. A unique internal buffer is used for those conversions with support for recursive calls.

- The lexer uses a single accumulative cells buffer for storing loaded values, with an inlined any-block stack.

- The lexer and lexer callbacks are fully recursive and GC-compliant. Currently callbacks can be function! only, this can be extended in the future to support routines also for much faster processing.

Fork me on GitHub