ITKarma picture

In this article, we'll talk about the implementation details and operation of various JIT compilers, as well as optimization strategies. We will discuss in sufficient detail, but we will omit many important concepts. That is, there will not be sufficient information in this article to reach reasonable conclusions in any comparison of implementations and languages.

For a basic understanding of JIT compilers, read this article .

Small note:

I will often describe optimization behavior and argue that it is probably in some other compiler as well. Although I do not always check if this optimization is in another JIT (sometimes everything is ambiguous), but if I know for sure, then I will point it out. I will also provide code examples to show where the optimization can be applied, but this is not accurate, because other optimizations may be prioritized. There may be some general simplifications, but no more than in most similar posts.


Milestones


  • Metatraces in Pypy
  • How GraalVM languages ​​support C extensions
  • Deoptimization
  • Inlining and OSR
  • Oceans nodes
  • Multilevel JIT usage

(Meta) trace


LuaJIT uses what is called tracing. Pypy performs metatracing, that is, it uses the system to generate trace and JIT interpreters. Pypy and LuaJIT are not exemplary Python and Lua implementations, but stand alone projects. I would characterize LuaJIT as shockingly fast, and it describes itself as one of the fastest dynamic language implementations - and I totally believe that.

To understand when to start tracing, the interpreter loop looks for hot loops (the concept of hot code is universal for all JIT compilers!). The compiler then "traces" the loop, recording executable operations to compile well-optimized machine code. In LuaJIT, compilation is based on traces with an instruction-like intermediate representation that is unique to LuaJIT.

How tracing is implemented in Pypy


Pypy starts tracing the function after 1619 executions and compiles after 1039 executions, that is, it takes about 3000 executions of the function for it to start working faster. These values ​​were carefully chosen by the Pypy team, and in general in the world of compilers, many constants are carefully chosen.

Dynamic languages ​​make optimization difficult. The code below can be statically removed in stricter language because CDMY0CDMY will always be false. However, in Python 2, this cannot be guaranteed until runtime.

if False: print("FALSE") 

For any intelligent program, this condition will always be false. Unfortunately, the CDMY1CDMY value can be redefined, and the expression will be in a loop, it can be redefined somewhere else. Therefore Pypy can create a "protector". If the defender fails, JIT falls back to the interpretation loop. Pypy then uses another constant (200) called trace eagerness to decide whether to compile the rest of the new path before the end of the loop. This subpath is called the bridge (bridge).

In addition, Pypy provides these constants as arguments that can be tuned at runtime along with unrolling configuration, i.e. loop expansion, and inlining! And in addition, it provides hooks that we can see after the compilation is complete.

def print_compiler_info(i): print(i.type) pypyjit.set_compile_hook(print_compiler_info) for i in range(10000): if False: pass print(pypyjit.get_stats_snapshot().counters) 

Above, I wrote a pure Python program with a compiling hook to display the type of compilation applied. Also, the code displays data at the end, which shows the number of defenders. For this program, I got one loop compilation and 66 defenders. When I replaced the CDMY2CDMY expression with a simple pass outside the CDMY3CDMY loop, there were only 59 defenders left.

for i in range(10000): pass # removing the `if False` saved 7 guards! 

By adding these two lines to the CDMY4CDMY loop, I got two compilations, one of which was of the "bridge" type!

if random.randint(1, 100) < 20: False=True 

Wait, you were talking about metatraces!


The idea of ​​metatracing can be described as "write an interpreter and get a compiler for free!" Or "turn your interpreter into a JIT compiler!" Writing a compiler is difficult, and if you can get it for free, then the idea is cool. Pypy "contains" an interpreter and a compiler, but it does not explicitly implement a traditional compiler.

Pypy has an RPython tool (built for Pypy). It is a framework for writing interpreters. Its language is a kind of Python and is statically typed. It is in this language that you need to write an interpreter. The language is not designed for typed Python programming because it does not contain standard libraries or packages. Any RPython program is a valid Python program. RPython code is transpiled to C and then compiled. Thus, the meta-compiler in this language exists as a compiled C program.

The prefix "meta" in the word metatraces means that tracing is done when the interpreter is running, not the program. It behaves more or less like any other interpreter, but it can track its operations and is designed to optimize traces by updating its path. With further tracing, the interpreter path becomes more optimized. A well-optimized interpreter follows a specific path. And the machine code used in this path, obtained by compiling RPython, can be used in the final compilation.

In short, the "compiler" in Pypy compiles your interpreter, which is why Pypy is sometimes called the meta-compiler. It does not compile so much the program you are executing as the path of the optimized interpreter.

The concept of metatrace may seem confusing, so for illustration purposes I wrote a very poor program that only understands CDMY5CDMY and CDMY6CDMY.

# interpreter written with RPython for line in code: if line == "a=0": alloc(a, 0) elif line == "a++": guard(a, "is_int") # notice how in Python, the type is unknown, but after being interpreted by RPython, the type is known guard(a, "> 0") int_add(a, 1) 

If I run this hot cycle:

a=0 a++ a++ 

The tracks might look like this:

# Trace from numerous logs of the hot loop a=alloc(0) # guards can go away a=int_add(a, 1) a=int_add(a, 2) # optimize trace to be compiled a=alloc(2) # the section of code that executes this trace _is_ the compiled code 

But the compiler is not a special separate module, it is built into the interpreter. Therefore, the interpretation cycle will look like this:

for line in code: if traces.is_compiled(line): run_compiled(traces.compiled(line)) continue elif traces.is_optimized(line): compile(traces.optimized(line)) continue elif line == "a=0" #.... 

Introduction to the JVM


I wrote in TruffleRuby based on Graal for four months and fell in love with it.

Hotspot (so named because it looks for hot spots ) is a virtual machine that comes with standard Java installations. It contains several compilers for implementing tiered compilation. Hotspot's 250,000 line codebase is open and has three garbage collectors. The virtual machine does an excellent job with JIT compilation, in some benchmarks it works no worse than C++ impls (there are so many crap disputes about this, google it). While Hotspot does not trace, it uses a similar approach: it interprets, profiles, and then compiles. This approach does not have its own name, it is closest to method-based JIT (optimization by method), or to layered JIT.

The strategies used in Hotspot have inspired many authors of subsequent JIT compilers, language virtual machine frameworks, and especially Javascript engines. Hotspot also spawned a wave of JVM languages ​​like Scala, Kotlin, JRuby, and Jython. JRuby and Jython are fun Ruby and Python implementations that compile source code into JVM bytecode, which Hotspot then executes. All of these projects have been relatively successful in speeding up the Python and Ruby languages ​​(Ruby more than Python) without implementing all the tooling, as is the case with Pypy. Hotspot is also unique in that it is a JIT for less dynamic languages ​​(although it is technically a JIT for JVM bytecode, not Java).

ITKarma picture

GraalVM is a JavaVM with a Java piece of code. It can execute any JVM language (Java, Scala, Kotlin, etc.). It also supports Native Image to work with AOT compiled code via the Substrate VM.A significant proportion of Twitter's Scala services run on Graal, which speaks of the quality of the virtual machine, and in some ways it is better than the JVM, although it is written in Java.

And that is not all! GraalVM also provides Truffle: a framework for implementing languages ​​by creating AST (Abstract Syntax Tree) interpreters. There is no explicit step in Truffle when the JVM bytecode is generated as in a regular JVM language. Rather, Truffle will simply use the interpreter and talk to Graal to generate machine code directly with profiling and so-called partial scoring. Partial evaluation is beyond the scope of this article, in a nutshell: this method adheres to the "write an interpreter, get a compiler for free!" Philosophy of metatracking, but approaches it differently.

TruffleJS is a Truffle implementation of Javascript that outperforms the V8 engine in a number of ways, which is really impressive because V8 has been in development for many years, Google has invested a lot of money and effort in it, and awesome specialists have worked on it. TruffleJS is not "better" than V8 (or any other JS engine) by most criteria, but it is a sign of hope for Graal.

Surprisingly great JIT compilation strategies


Interpreting C


JIT implementations often have a problem supporting C extensions. Standard interpreters such as Lua, Python, Ruby, and PHP have an API for C that allows users to build packages in that language, greatly speeding up execution. Many packages are written in C, for example numpy, standard library functions like CDMY7CDMY. All of these C extensions are critical to making interpreted languages ​​fast.

C extensions are difficult to maintain for a number of reasons. The most obvious reason is that the API is designed with internal implementation in mind. Moreover, it is easier to maintain C extensions when the interpreter is written in C, so JRuby cannot support C extensions, but it does have an API for Java extensions. Pypy recently released a beta version of support for C extensions, although I'm not sure if it works due to Hyrum's law . LuaJIT supports C extensions, including additional features in its C extensions (LuaJIT is awesome!)

Graal solves this problem with Sulong, an engine that executes LLVM bytecode on GraalVM, converting it to Truffle. LLVM is a set of tools, and we just need to know that C can be compiled to LLVM bytecode (Julia has an LLVM backend too!). It's strange, but the solution is to take a good compiled language with over forty years of history and interpret it! Of course, it doesn't run nearly as fast as properly compiled C, but it does get several benefits.

ITKarma picture

LLVM bytecode is already quite low-level, that is, it is not as inefficient to apply JIT to this intermediate representation as to C. Part of the cost is compensated by the fact that the bytecode can be optimized along with the rest of the Ruby program, but we cannot optimize the compiled C program... All of these memory strips, inlining, dead chunks, and more can be applied to C and Ruby code, instead of calling the C binary from Ruby code. TruffleRuby C extensions are faster than CRuby C extensions in some respects.

For this system to work, you need to know that Truffle is completely language independent and the overhead of switching between C, Java, or any other language within Graal will be minimal.

Graal's ability to work with Sulong is part of their polyglot capabilities, which allows for high language interchangeability. This is not only good for the compiler, but it also proves that you can easily use multiple languages ​​in one "application".

Back to the interpreted code, it's faster


We know that JITs contain an interpreter and a compiler, and that they move from interpreter to compiler to speed things up. Pypy creates bridges for the return path, although from the point of view of Graal and Hotspot this is de-optimization .We are not talking about completely different concepts, but by deoptimization we mean a return to the interpreter as a conscious optimization, and not a solution to the inevitability of a dynamic language. Hotspot and Graal make heavy use of de-optimization, especially Graal, because developers have tight control over compilation, and they need even more control over compilation for the sake of optimizations (compared to, say, Pypy). Deoptimization is also used in JS engines, which I will talk about a lot, since JavaScript in Chrome and Node.js depends on it.

To quickly apply de-optimization, it is important to make sure you switch between the compiler and interpreter as quickly as possible. With the most naive implementation, the interpreter will have to “catch up” with the compiler in order to perform de-optimization. Additional complications are associated with de-optimization of asynchronous threads. Graal creates a set of frames and matches it against the generated code to return to the interpreter. Safepoints can be used to make a Java thread pause and say, "Hello, garbage collector, do I need to stop?" It turned out to be rather crude, but it works fast enough that de-optimization is a good strategy.

ITKarma picture

Similar to the Pypy bridging example, monkey patching of functions can also be de-optimized. It's more elegant because we add de-optimization code not when the defender fails, but when guerrilla patching is applied.

A great example of JIT de-optimization: conversion overflow is an unofficial term. We are talking about a situation when a certain type is internally represented/allocated (for example, CDMY8CDMY), but it needs to be converted to CDMY9CDMY. TruffleRuby does this with de-optimizations, just like V8.

For example, if you specify CDMY10CDMY in Ruby, you get CDMY11CDMY (Ruby calls this CDMY12CDMY and CDMY13CDMY, but I will use the notation CDMY14CDMY and CDMY15CDMY). When performing an operation on CDMY16CDMY, you need to check if a value overflow occurs. But it's one thing to check, and compiling code that handles overflows is expensive, especially given the frequency of numeric operations.

Without even looking at the compiled instructions, you can see how this de-optimization reduces the amount of code.

int a, b; int sum=a + b; if (overflowed) { long bigSum=a + b; return bigSum; } else { return sum; } int a, b; int sum=a + b; if (overflowed) { Deoptimize! } 

In TruffleRuby, only the first run of a particular operation is deoptimized, so we don't waste resources on it every time an operation overflows.

WET code - fast code. Inlining and OSR


function foo(a, b) { return a + b; } for (var i=0; i < 1000000; i++) { foo(i, i + 1); } foo(1, 2); 

Even trivialities like these triggers are deoptimized in V8! With options like CDMY17CDMY and CDMY18CDMY you can gather a lot of information about the JIT and also modify the behavior. Graal has some very useful tools, but I will use V8 because many have it installed.

Deoptimization is launched by the last line (CDMY19CDMY), which is puzzling, because this call was made in a loop! We will receive a message "Insufficient type feedback for call" (a complete list of reasons for deoptimization is here , and it has a funny "no reason"). This creates an input frame displaying the literals CDMY20CDMY and CDMY21CDMY.

So why de-optimize? V8 is smart enough to do the casting: when CDMY22CDMY is of type CDMY23CDMY, literals are passed as CDMY24CDMY too.

To understand this, let's replace the last line with CDMY25CDMY. But de-optimization is still applied, only this time the message is different: "Insufficient type feedback for binary operation". WHY?! After all, this is exactly the same operation that is performed in a loop, with the same variables!

The answer, my friend, lies in on-stack replacement (OSR). Inlining is a powerful compiler optimization (not just JIT), in which functions cease to be functions, and content is passed to the place of calls.JIT compilers can inline to increase speed by changing the code at runtime (compiled languages ​​can only inline statically).

//partial output from printing inlining details [compiling method 0x04a0439f3751 <JSFunction (sfi=0x4a06ab56121)> using TurboFan OSR] 0x04a06ab561e9 <SharedFunctionInfo foo>: IsInlineable? true Inlining small function(s) at call site #49:JSCall 

So V8 will compile CDMY26CDMY, determine that it can be inline, and inline with OSR. However, the engine does this only for the code inside the loop, because this is a hot path, and the last line is not yet in the interpreter at the time of inlining. Therefore, V8 does not yet have enough feedback on the type of the CDMY27CDMY function, because it is not used in a loop, but its inline version. If we apply CDMY28CDMY, then there will be no de-optimization, no matter what we pass, a literal or CDMY29CDMY. However, without inlining, even a tiny million iterations will be noticeably slower. JIT compilers really embody the no-solution-only-trade-offs principle. Deoptimizations are expensive, but they don't compare to the cost of searching for methods and inlining, which is preferred in this case.

Inlining is incredibly effective! I ran the above code with a couple of extra zeros, and with inlining turned off, it ran four times slower.

ITKarma picture

Although this article is about JIT, inlining is effective in compiled languages ​​as well. All LLVM languages ​​actively use inlining, because LLVM will do it too, although Julia is inline without LLVM, this is in her nature. JITs can be inlined using runtime heuristics and are able to switch between non-inlining and inlining modes using OSR.

A note on JIT and LLVM


LLVM provides a bunch of compilation-related tools. Julia works with LLVM (note that this is a large toolbox and each language uses it differently), just like Rust, Swift and Crystal. Suffice it to say that it is a large and wonderful project that also supports JITs, although LLVM does not have significant built-in dynamic JITs. The fourth level of JavaScriptCore compilation used the LLVM backend for a while, but it was replaced less than two years ago. Since then, this toolkit has not been very well suited for dynamic JITs, mainly because it is not designed to work in a dynamic environment. Pypy tried it 5-6 times, but settled on JSC. With LLVM, allocation sinking and code motion were limited. It was also impossible to use powerful JIT features like range-inferencing (it's like casting, but with a known range of values). But more importantly, with LLVM, compiling is a lot of resources.

What if, instead of an instruction-based intermediate representation, we have a large graph that modifies itself?


We talked about LLVM bytecode and Python/Ruby/Java bytecode as an intermediate representation. They all look like some kind of language in the form of instructions. Hotspot, Graal, and V8 use the "Sea of ​​Nodes" intermediate representation (introduced in Hotspot), which is a lower-level AST. This is an effective view because a significant part of profiling is based on the notion of a certain path that is rarely used (or overlaps in the case of some pattern). Note that these AST compilers are different from the AST parsers.

Usually I adhere to the position of "try to do it at home!" For example, I cannot read all the graphs not only due to lack of knowledge, but also because of the computational capabilities of my brain (compiler options can help get rid of behavior that I am not interested in).

ITKarma picture

In the case of V8, we will use the D8 tool with the CDMY30CDMY flag. For Graal, this will be CDMY31CDMY. The text will be displayed on the screen (formatted to get a graph). I don’t know how the V8 developers generate visual graphs, but Oracle has an “Ideal Graph Visualizer” which is used in the previous illustration.I didn't have the strength to reinstall IGV, so I took the graphs from Chris Seaton, generated with Seafoam, whose source is now closed.

Okay, let's take a look at the JavaScript AST!

function accumulate(n, a) { var x=0; for (var i=0; i < n; i++) { x += a; } return x; } accumulate(1, 1) 

I ran this code through CDMY32CDMY, although we are only interested in the CDMY33CDMY function. See that I only called it once, that is, I don't have to wait for compilation to get the AST.

This is what the AST looks like (I removed some unimportant lines):

FUNC at 19. NAME "accumulate". PARAMS.. VAR (0x7ff5358156f0) (mode=VAR, assigned=false) "n".. VAR (0x7ff535815798) (mode=VAR, assigned=false) "a". DECLS.. VARIABLE (0x7ff5358156f0) (mode=VAR, assigned=false) "n".. VARIABLE (0x7ff535815798) (mode=VAR, assigned=false) "a".. VARIABLE (0x7ff535815840) (mode=VAR, assigned=true) "x".. VARIABLE (0x7ff535815930) (mode=VAR, assigned=true) "i". BLOCK NOCOMPLETIONS at -1.. EXPRESSION STATEMENT at 38... INIT at 38.... VAR PROXY local[0] (0x7ff535815840) (mode=VAR, assigned=true) "x".... LITERAL 0. FOR at 43.. INIT at -1... BLOCK NOCOMPLETIONS at -1.... EXPRESSION STATEMENT at 56..... INIT at 56...... VAR PROXY local[1] (0x7ff535815930) (mode=VAR, assigned=true) "i"...... LITERAL 0.. COND at 61... LT at 61.... VAR PROXY local[1] (0x7ff535815930) (mode=VAR, assigned=true) "i".... VAR PROXY parameter[0] (0x7ff5358156f0) (mode=VAR, assigned=false) "n".. BODY at -1... BLOCK at -1.... EXPRESSION STATEMENT at 77..... ASSIGN_ADD at 79...... VAR PROXY local[0] (0x7ff535815840) (mode=VAR, assigned=true) "x"...... VAR PROXY parameter[1] (0x7ff535815798) (mode=VAR, assigned=false) "a".. NEXT at 67... EXPRESSION STATEMENT at 67.... POST INC at 67..... VAR PROXY local[1] (0x7ff535815930) (mode=VAR, assigned=true) "i". RETURN at 91.. VAR PROXY local[0] (0x7ff535815840) (mode=VAR, assigned=true) "x" 

It is difficult to parse this, but it looks like a parser AST (not true for all programs). And the following AST is generated with Acorn.js

The notable difference is the definitions of the variables. In the parser AST, there is no explicit definition of parameters, and the loop declaration is hidden in the CDMY34CDMY node. In a compiler-level AST, all declarations are grouped with addresses and metadata.

The compiler AST also uses this wacky expression CDMY35CDMY. The parser's AST cannot determine the relationship between names and variables (by addresses) due to hoisting of variables (hoisting), evaluation (eval) and others. So the compiler's AST uses the CDMY36CDMY variables, which are later linked to the actual variable.

//This chunk is the declarations and the assignment of `x=0` . DECLS.. VARIABLE (0x7ff5358156f0) (mode=VAR, assigned=false) "n".. VARIABLE (0x7ff535815798) (mode=VAR, assigned=false) "a".. VARIABLE (0x7ff535815840) (mode=VAR, assigned=true) "x".. VARIABLE (0x7ff535815930) (mode=VAR, assigned=true) "i". BLOCK NOCOMPLETIONS at -1.. EXPRESSION STATEMENT at 38... INIT at 38.... VAR PROXY local[0] (0x7ff535815840) (mode=VAR, assigned=true) "x".... LITERAL 0 

And this is how the AST of the same program, obtained using Graal, looks like!

ITKarma picture

It looks much simpler. Red indicates control flow, blue indicates data flow, arrows indicate directions. Note that while this graph is simpler than the AST from V8, this does not mean that Graal is better at simplifying the program. It's just generated based on Java, which is much less dynamic. The same Graal graph generated from Ruby will be closer to the first version.

It's funny that AST in Graal will change depending on the execution of the code. This graph is generated with OSR and inlining disabled, when the function is repeatedly called with random parameters so that it is not optimized. And the dump will provide you with a whole bunch of graphs! Graal uses a specialized AST to optimize programs (V8 does similar optimizations, but not at the AST level). When you save graphs in Graal, you get more than ten schemes with different optimization levels. When rewriting nodes, they replace themselves (specialize) with other nodes.

The above graph is an excellent example of specialization in a dynamically typed language (picture taken from One VM to Rule Them All, 2013). The reason this process exists is closely related to how partial assessment works - it's all about specialization.

Hooray JIT compiled the code! Let's compile again! And again!


Above I mentioned about "multilevel", let's talk about it. The idea is simple: if we are not ready to create fully optimized code yet, but interpretation is still expensive, we can pre-compile and then final compile when we are ready to generate more optimized code.

Hotspot is a layered JIT with two compilers, C1 and C2. C1 quickly compiles and runs the code, then does full profiling to get the code compiled with C2. This can help solve many warm-up problems. Non-optimized compiled code is faster than interpretation anyway. Also, C1 and C2 do not compile all of the code. If the function looks simple enough, with a high probability C2 will not help us and will not even run (we will also save time on profiling!). If C1 is busy compiling, then profiling can continue, C1's work will be interrupted and compilation with C2 starts.

ITKarma picture

JavaScript Core has even more levels! In fact, there are three JITs . The JSC interpreter does some light profiling, then goes to Baseline JIT, then DFG (Data Flow Graph) JIT, and finally FTL (Faster than Light) JIT. With so many levels, the meaning of de-optimization is no longer limited to the transition from compiler to interpreter, de-optimization can be performed from DFG to Baseline JIT (this is not the case in the case of Hotspot C2- > C1).All de-optimizations and transitions to the next level are done using OSR (Stack Override).

Baseline JIT connects after about 100 executions, and DFG JIT after about 1000 (with some exceptions). This means that the JIT gets the compiled code much faster than the same Pypy (which takes about 3000 executions). Layering allows the JIT to try to correlate the duration of code execution with the duration of its optimization. There are a bunch of tricks about what kind of optimization (inlining, casting, etc.) to perform at each of the levels, and therefore this strategy is optimal.

Useful Resources


...

Source