Do bytecode compilers compile syntax directly or intermediate assembly language?

I am going to write a very simple VM and bytecode compiler. Is it typical for a bytecode compiler to read the syntax and attempt to create the bytecode directly or is there an intermediate stage to make the process easier to understand?

If I do compiler -> bytecode – it should be a less tedious process, but maybe more difficult for me to understand while writing the program.

If I do compiler -> intermediate language -> bytecode – it should be simpler, but will take much more time because I have to write three programs.

Which is better?

7

If I do compiler -> intermediate language -> bytecode – it should be simpler, but will take much more time because I have to write three programs.

You don’t need to write three programs, you just need to write three modules in the same compiler program (and these modules do share some data, so you want them in the same process & virtual address space). You might but probably should not organize your compiler as three separate programs (running in different processes which would communicate by some IPC such as sockets or pipes) but this approach is inconvenient and harder (unless your ambition is to explore the design of a concurrent compiler running on several cores or computers in parallel; AFAIK there are very few such compilers -I can’t name any-, and I am interested by references). Most compilers (and all the one I know) are sequential single-threaded programs.

However, you should define and care about internal intermediate representations and how your compiler is transforming them. You really should read the latest purple Dragon Book qnd Queinnec’s Lisp In Small Pieces (focused on interpreting & compiling Lisp dialects such as Scheme, but you can adapt to your programming language most of the techniques there). Remember that a programming language is some specification written in some report, and do not confuse that with an implementation (e.g. some compiler or interpreter).

Most compilers have three parts or layers (with the middle one being the biggest; in GCC it is larger than the front-end & back-end combined!):

  • the “front-end” or parser is taking some source text, and build some internal representation called abstract syntax tree (AST). This an easy and well understood part, you’ll just use standard parsing techniques (e.g. a parser generator like ANTLR or flex+bison). Some compilers (e.g. GCC) have several front-ends because they are able to compile several programming languages (e.g. C, C++, Fortran, Ada, .. for GCC) and they basically (sort-of) define a common AST type for all the languages they can accept (in GCC, it is called GENERIC)

  • the “middle-end” is transforming some internal representations (IR), starting from the AST produced by the front-end. In the simplest cases, you just transform an AST into another one to perform very simple transformations (e.g. constant folding). In other compilers, you want some A-normal form as some IR (because precise garbage collection techniques mandates some A-NF), or you use CPS. Most C compilers would transform for into a while-like IR (because the C standard mandates such an equivalence). In more serious optimizing compilers, you have several different internal representations (some are tree-like like ASTs, others could be DAG or even cyclic graphs) and many dozens or even hundreds of passes which are transforming various internal representations. In many compilers, some SSA form is very useful at some point. Type inference can be seen as an internal representation transformation. In GCC, an important family of internal representation is the GIMPLE (which is the same IR for all source languages accepted by GCC, and on which most optimization passes are operating).

  • the “back-end” (or code-generator) is transforming some kind of low-level IR inside the middle-end into some linear representation, like byte code, machine code, or assembler. BTW, GCC has one back-end per target processor (starting with some Gimple/SSA form). In some cases, a back-end contains target-specific optimization transformations. In GCC, register allocation and instruction scheduling happens in the back-end, and RTL is a back-end specific (so target specific) representation useful in the back-ends.

AFAIK, many bytecode compilers (notably ocamlc & javac, but also Clang producing LLVM bytecode) have a similar structure; they do have some middle-end (which performs some simple “optimizations”).

If you are a user of gcc, you should try once to compile some simple (but not too simple, you want loops and calls in it) source file simple.c as gcc -fdump-tree-all -c -Wall -O simple.c and you’ll see many dump files simple.c.* showing some textual incomplete representation of the involved internal representations. You can look into these dump files with a pager (less) or some editor. You’ll be frightened and surprised by the amount and variety of compiler passes and internal representations.

If you want more pictures about that related to GCC, look also into the slides I wrote in the documentation part of GCC-MELT (which has a lot of external references too).

Even a very simple bytecode compiler should have some middle-end (at least for constant folding, and also for type inference if your language has some, and probably for some kind of “normalization” -e.g. A-NF could be useful if you have a “stack” bytecode à la JVM or Neko or Lua- depending on the bytecode form) and should define some internal representation[s] (probably some AST at least) before transforming it into bytecode (in your “back-end”).

BTW, you might perhaps focus only on coding your compiler, and use some existing bytecode (e.g. JVM, Neko, Lua, Parrot, Ocaml, LLVM, Guile2 …) instead of designing your own. You could also use some existing JIT library, e.g. GCCJIT, GNU lightning, the libLLVM, asmjit, libjit, etc… since JIT libraries are conceptually similar to some kind of “bytecode” (they implicitly offer some kind of “back-end” representation thru their API). You could compile to a subset of C, i.e. view (some subset of) C as a sort of bytecode (or compile to some existing programming language with a good enough implementation). You probably should care about garbage collection (the GC handbook is a useful book to read) and you should probably use or take inspiration from some existing one (e.g. Boehm’s GC, Ravenbrook’s MPS, or my unmaintained Qish, or those inside many existing bytecode VMs, sometimes called P-code machines). Knowing a tiny bit about calling conventions & ABIs should help. You might need libffi.

I do recommend studying the source code of some existing small compiler & bytecode free software implementation before starting your own.

If I do compiler -> intermediate language -> bytecode – it should be simpler

Indeed, but don’t think or speak of intermediate language but simply of intermediate internal representations; you generally don’t need these to be a “language”, and you probably don’t want them to be easily “serializable” (there are good reasons to serialize the IR of a compiler -e.g. -flto in GCC-, but that is not trivial and is a transformation by itself).
And the internal representations are often not “trees” but some much more complex data structures (probably with circularities, e.g. related to cross-references and/or symbol tables – which are part of the internal representations).

Notice that bytecode compilers are still compilers, and most of the knowledge is common to compilers generating machine code (since a bytecode is some kind of simple abstract and idealized instruction set). The main difference is that a bytecode compiler usually won’t do any kind of back-end optimization (such as those required by and provided or related to register allocation & instruction scheduling).

(I did mention several free software implementations above, actually most of the above-mentioned code is free software)

Trả lời

Email của bạn sẽ không được hiển thị công khai. Các trường bắt buộc được đánh dấu *