Edit: go straight to stocklund’s response on the Rust Internals forum for the final word on this topic.
….
Computed gotos or tail calls may give a worthwhile advantage on older or low-power architectures when implementing an FSM or a VM dispatch loop. There are a lot of these around, ARM processors being ubiquitous. The performance improvement over a single match statement could be up to 20%.
On Haswell and later wide-issue Intel CPUs, it is claimed that branch predictor performance reduces
the advantage of distributed dispatch points over a single switch and this experiment confirms this.
On such hardware, a single Rust match
expression will be almost insdistinguishable in performance over
computed gotos or tail calls.
At this time there is no portable way to produce computed gotos or tail call optimization in compiled machine code from Rust. This experiment investigates what is possible, even if non-portable or unsafe.
The results are tabluated and graphed in this Google Sheet. The project code itself is hosted on Github.
Read on for an explanation!
See the Wikipedia page for an overview of the higher level topic “Threaded Code.”
Computed gotos are an occasionally requested feature of Rust for implementing threaded interpreters and finite state machines. A Google search will turn up numerous discussions on interpreted language mailing lists on converting to computed goto dispatch. GCC and clang both support computed gotos as an extension to the C language. As a systems language in the same space, it does not seem unreasonable to wish for support in Rust.
An alternative to explicit computed gotos is exploiting tail call optimization, invoking a jump instruction to enter the subsequent state or instruction function.
Rust provides neither guaranteed tail calls nor computed gotos.
When computed gotos and optimized tail calls are unavailable, the fallback standard is to use switch/match statements. It must be noted that a switch/match compiles to a single computed goto, but it cannot be used to jump to arbitrary points in a function as with the full Computed Gotos feature.
For a single switch/match, the most cited paper on the topic describes a worst case 100% branch predictor prediction failure rate under VM dispatch circumstances, at least for now-old CPU implementations.
I thought I’d conduct some experiments to get first hand experience of the performance advantages of computed gotos, and to find out what is possible in Rust.
The experiment consists of three tests executed across four dispatch methods, each implementing the same virtual machine instruction set, in turn run on five different CPUs.
These CPUS are:
CPU | System | OS | Architecture/code-name |
---|---|---|---|
ARM Cortex-A57 | Qualcomm MSM8992 in my Nexus 5x | Android 7.0 | ARM aarch64 |
Intel Atom N450 | my old HP netbook from 2009 | Ubuntu 16.04 | Intel Pineview |
Intel Core2 Duo T8300 | my old Dell D830 from 2008 | Ubuntu 16.04 | Intel Penryn |
Intel Xeon E5-2666 | an EC2 c4.large | Ubuntu 16.04 | Intel Haswell |
AMD A4-6210 | my HP-21 from 2014 | Windows 10 | AMD Beema |
The next two subsections, The Virtual Machine and The Three Tests describe the minimal language VM instruction set and memory model and the three sets of opcode sequences that exercise branch prediction in different ways respectively.
Following those are the explanations of the four different dispatch methods: Single Match Dispatch, Single Match Unrolled Loop Dispatch, Tail Call Dispatch and Computed Goto Dispatch.
The VM is implemented in vm.rs. Since dispatch performance is the focus of the experiment, the features supported by the VM are far below what would be required to implement a useful programming language.
The instruction set allows for a handful of arithmetic operations, comparisons, branches and a pseudorandom number generator.
Instructions support three types: Integer, Boolean and None. Because of the simplicity of the instruction set, the None value is also used as an error type. It is not used in the tests.
The memory model is a 256 slot register set. No stack, no heap.
“Bytecode” instructions are fixed 32 bits wide with the low 8 bits forming the operator and the higher sets of 8 or 16 bits forming register or literal operands.
The code in vm.rs
is deliberately designed to be dispatch-method agnostic: no default method is
provided, merely the memory model and instruction function definitions. This separation of concerns
should cost no overhead in the world of Rust’s cost-free abstractions.
All three tests are coded in fixture.rs as hand-coded bytecode sequences for the virtual machine.
This test is comprised of one loop inside another. The instruction sequence is very short and utterly predictable. The performance of this test should give a baseline performance-high for CPUS in which they should be able to predict every indirect branch.
This test is only slightly less predictable than Nested Loop but the instruction sequence is somewhat longer. It is essentially Nested Loop unrolled a handfull of times with some NOP instructions added in different patterns among each unroll instance.
This test should fit somewhere inbetween Nested Loop and Unpredictable in that while it is predictable, it also requires more than basic indirect branch prediction.
The core of this test is the use of a pseudorandom number generator. On the roll of the pseudo-dice various sections of code in the loop will be skipped or included. This should make the overall instruction sequence essentially unpredictable to any branch predictor.
This test should demonstrate the low point of performance for each CPU with frequent pipeline flushes.
Direct comparison of this test to the other two tests is complicated by the use of the random number generator in that there may be overhead in using it that the other two tests do not include.
switch.rs compiles to a jump table implementation. For these inline examples I’ll pull the x86_64 assembly. The aarch64 assembly is comparable in instruction type and count; the x86 assembly relies on the stack a bit more due to the lack of registers.
LLVM has viewed the VM instruction code and dispatch loop as a whole, allocating registers efficiently across the whole function.
unrollswitch.rs compiles to a series of jump tables. This is identical to the Single Match dispatch test, except the loop is unrolled a handful of times. In addition, when a VM branch instruction is executed and the branch is taken, control flow jumps to the top of the loop. My idea here was that under tight bytecode loop conditions, this could effectively unroll the bytecode loop too. The huge disadvantage is that the VM instruction code is duplicated the number of times of the unroll count. This cannot be good for the instruction cache hit rate, or certainly would not be for an interpreter with a high operator count.
LLVM as called by rustc produces TCO assembly for x86_64, arm and aarch64, but only for release builds. x86 builds will hit the stack limit and could not be included in the results.
threaded.rs compiles to a single jump table shared by all the VM instruction functions:
As suggested in this forum discussion, we should get six registers for parameter passing
on x86_64. We’re using four, keeping opcode
, PC
and counter
off the stack, which is
at least consistent with switch.rs
and the other implementations. There’s a good chance
we could do better but I’m not sure how to go about it.
What is notable is the overhead of pushing and popping rax
on and off the stack and that LLVM
treats each function as a separate unit with calling convention constraints.
For this experiment, I wanted to see if I could create a computed goto environment close to what is possible in clang and gcc. In order to do that I would have to resort to inline assembly and, sadly, nightly rustc.
In my first attempt I used inline assembly to populate a jump table with label addresses and
insert jmp
instructions after each VM instruction block. This produced segmentation faults.
After studying the assembly output from rustc for a while I realized that LLVM could not
intelligently understand that the jmp
instructions would affect code flow: it was allocating
registers throughout the function with the assumption that code flow would fall all the way
through to the end of the function in sequence. Register allocation varied throughout the
function but my jmp
instructions disrupted the allocation flow.
The fix for this in threadedasm.rs
is to introduce constraints. Each VM instruction block of code must be
prefixed and postfixed with register constraints, pinning variables to specific variables
to keep the allocation flow consistent no matter where in the function a jmp
instruction
goes.
The optimized assembly output is the most compact of any of the dispatch methods and overall, this code outperforms the other methods.
Result data is tabulated and charted in this Google Sheets document.
With apologies for the quality of the embedded chart images due to Google Sheets limitations, the chart that best illustrates the data is ImprovementOverSwitch. Do check out the link above to interact with the spreadsheet and charts directly.
This chart illustrates the ratio of VM instructions per second of each other dispatch method
against switch.rs
, normalizing the performance of switch.rs
for each test to 1.0.
The unrollswitch.rs
figures are shown in shades of blue, threaded.rs
in yellow and
threadedasm.rs
in shades of green.
threadedasm.rs
performs best overall with unrollswitch.rs
also doing well,
though it is assumed that that is largely because the virtual machine is very small and
fits into I-cache.Again, go to the spreadsheet to see this chart directly for a better view; this next chart illustrates the absolute performance of each method and test in cycles per VM instruction. Color coding remains the same as the earlier chart.
threadedasm.rs
to the other two tests,
though: the Intel CPUs have identical performance patterns, showing a stepping up in cycle count
from Nested Loop to Longer Repetitive to Unpredictable whereas ARM and AMD results
show Longer Repetitive performing similarly to or worse than Unpredictable.Tail call dispatch comes with function-call instruction overhead that varies by architecture.
It is also possibly hindered by the inability of LLVM to holistically optimize
all interpreter instruction functions. These combine to add a few instructions of overhead
compared to the inline-assembly single-function threadedasm.rs
code.
In addition, Rust and LLVM do not TC-optimize for 32bit
x86 or debug builds, making this a non-option as long as Rust does not explicitly support TCO.
If the FSM or VM is particularly small, unrolling the dispatch loop may be an option as it does give a performance increase under Unpredictable circumstances.
With respect to computed gotos for threaded dispatch, in my opinion it should be possible to encapsulate the inline assembly in macros that could be imported from a crate. Because inline assembly is required, this cannot currently be done in stable Rust. Compiler support beyond inline assembly and possibly procedural macros should not be required.
It seems there may be some work involved before inline assembly can be stabilized.
If targeting modern high-performance Intel architectures, dispatch method may make little difference. Any other architecture, however, may benefit from dispatch method optimization.