LLVM Project Blog

LLVM Project News and Details from the Trenches

Tuesday, March 13, 2018

DragonFFI: FFI/JIT for the C language using Clang/LLVM


Introduction

A foreign function interface is "a mechanism by which a program written in one programming language can call routines or make use of services written in another".
In the case of DragonFFI, we expose a library that allows calling C functions and using C structures from any languages. Basically, we want to be able to do this, let's say in Python:
import pydffi
CU = pydffi.FFI().cdef("int puts(const char* s);");
CU.funcs.puts("hello world!")
or, in a more advanced way, for instance to use libarchive directly from Python:
import pydffi
pydffi.dlopen("/path/to/libarchive.so")
CU = pydffi.FFI().cdef("#include <archive.h>")
a = funcs.archive_read_new()
assert a
...
This blog post presents related works, their drawbacks, then how Clang/LLVM is used to circumvent these drawbacks, the inner working of DragonFFI and further ideas.
The code of the project is available on GitHub: https://github.com/aguinet/dragonffi. Python 2/3 wheels are available for Linux/OSX x86/x64. Python 3.6 wheels are available for Windows x64. On all these architectures, just use:
$ pip install pydffi
and play with it :)

See below for more information.

Related work

libffi is the reference library that provides a FFI for the C language. cffi is a Python binding around this library that also uses PyCParser to be able to easily declare interfaces and types. Both these libraries have limitations, among them:
  • libffi does not support the Microsoft x64 ABI under Linux x64. It isn't that trivial to add a new ABI (hand-written ABI, get the ABI right, ...), while a lot of effort have already been put into compilers to get these ABIs right.
  • PyCParser only supports a very limited subset of C (no includes, function attributes, ...).
Moreover, in 2014, Jordan Rose and John McCall from Apple made a talk at the LLVM developer meeting of San José about how Clang can be used for C interoperability. This talk also shows various ABI issues, and has been a source of inspiration for DragonFFI at the beginning.

Somehow related, Sean Callanan, who worked on lldb, gave a talk in 2017 at the LLVM developer meeting of San José on how we could use parts of Clang/LLVM to implement some kind of eval() for C++. What can be learned from this talk is that debuggers like lldb must also be able to call an arbitrary C function, and uses debug information among other things to solve it (what we also do, see below :)).

DragonFFI is based on Clang/LLVM, and thanks to that it is able to get around these issues:
  • it uses Clang to parse header files, allowing direct usage of a C library headers without adaptation;
  • it support as many calling conventions and function attributes as Clang/LLVM do;
  • as a bonus, Clang and LLVM allows on-the-fly compilation of C functions, without relying on the presence of a compiler on the system (you still need the headers of the system's libc thought, or MSVCRT headers under Windows);
  • and this is a good way to have fun with Clang and LLVM! :)
Let's dive in!

Creating an FFI library for C

Supporting C ABIs

A C function is always compiled for a given C ABI. The C ABI isn't defined per the official C standards, and is system/architecture-dependent. Lots of things are defined by these ABIs, and it can be quite error prone to implement.

To see how ABIs can become complex, let's compile this C code:

typedef struct {
  short a;
  int b;
} A;

void print_A(A s) {
  printf("%d %d\n", s.a, s.b);
}

Compiled for Linux x64, it gives this LLVM IR:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [7 x i8] c"%d %d\0A\00", align 1

define void @print_A(i64) local_unnamed_addr {
  %2 = trunc i64 %0 to i32
  %3 = lshr i64 %0, 32
  %4 = trunc i64 %3 to i32
  %5 = shl i32 %2, 16
  %6 = ashr exact i32 %5, 16
  %7 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([7 x i8], [7 x i8]* @.str, i64 0, i64 0), i32 %6, i32 %4)
  ret void
}

What happens here is what is called structure coercion. To optimize some function calls, some ABIs pass structure values through registers. For instance, an llvm::ArrayRef object, which is basically a structure with a pointer and a size (see https://github.com/llvm-mirror/llvm/blob/release_60/include/llvm/ADT/ArrayRef.h#L51), is passed through registers (though this optimization isn't guaranteed by any standard).

It is important to understand that ABIs are complex things to implement and we don't want to redo this whole work by ourselves, particularly when LLVM/Clang already know how.

Finding the right type abstraction

We want to list every types that is used in a parsed C file. To achieve that goal, various information are needed, among which:
  • the function types, and their calling convention
  • for structures: field offsets and names
  • for union/enums: field names (and values)
On one hand, we have seen in the previous section that the LLVM IR is too Low Level (as in Low Level Virtual Machine) for this. On the other hand, Clang's AST is too high level. Indeed, let's print the Clang AST of the code above:
[...]
|-RecordDecl 0x5561d7f9fc20 <a.c:1:9, line:4:1> line:1:9 struct definition
| |-FieldDecl 0x5561d7ff4750 <line:2:3, col:9> col:9 referenced a 'short'
| `-FieldDecl 0x5561d7ff47b0 <line:3:3, col:7> col:7 referenced b 'int'
We can see that there is no information about the structure layout (padding, ...). There's also no information about the size of standard C types. As all of this depends on the backend used, it is not surprising that these informations are not present in the AST.

The right abstraction appears to be the LLVM metadata produced by Clang to emit DWARF or PDB structures. They provide structure fields offset/name, various basic type descriptions, and function calling conventions. Exactly what we need! For the example above, this gives (at the LLVM IR level, with some inline comments):

target triple = "x86_64-pc-linux-gnu"
%struct.A = type { i16, i32 }
@.str = private unnamed_addr constant [7 x i8] c"%d %d\0A\00", align 1

define void @print_A(i64) local_unnamed_addr !dbg !7 {
  %2 = trunc i64 %0 to i32
  %3 = lshr i64 %0, 32
  %4 = trunc i64 %3 to i32
  tail call void @llvm.dbg.value(metadata i32 %4, i64 0, metadata !18, metadata !19), !dbg !20
  tail call void @llvm.dbg.declare(metadata %struct.A* undef, metadata !18, metadata !21), !dbg !20
  %5 = shl i32 %2, 16, !dbg !22
  %6 = ashr exact i32 %5, 16, !dbg !22
  %7 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([...] @.str, i64 0, i64 0), i32 %6, i32 %4), !dbg !23
  ret void, !dbg !24
}

[...]
; DISubprogram defines (in our case) a C function, with its full type
!7 = distinct !DISubprogram(name: "print_A", scope: !1, file: !1, line: 6, type: !8, [...], variables: !17)
; This defines the type of our subprogram
!8 = !DISubroutineType(types: !9)
; We have the "original" types used for print_A, with the first one being the
; return type (null => void), and the other ones the arguments (in !10)
!9 = !{null, !10}
!10 = !DIDerivedType(tag: DW_TAG_typedef, name: "A", file: !1, line: 4, baseType: !11)
; This defines our structure, with its various fields
!11 = distinct !DICompositeType(tag: DW_TAG_structure_type, file: !1, line: 1, size: 64, elements: !12)
!12 = !{!13, !15}
; We have here the size and name of the member "a". Offset is 0 (default value)
!13 = !DIDerivedType(tag: DW_TAG_member, name: "a", scope: !11, file: !1, line: 2, baseType: !14, size: 16)
!14 = !DIBasicType(name: "short", size: 16, encoding: DW_ATE_signed)
; We have here the size, offset and name of the member "b"
!15 = !DIDerivedType(tag: DW_TAG_member, name: "b", scope: !11, file: !1, line: 3, baseType: !16, size: 32, offset: 32)
!16 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)
[...]

Internals

DragonFFI first parses the debug information included by Clang in the LLVM IR it produces, and creates a custom type system to represent the various function types, structures, enumerations and typedefs of the parsed C file. This custom type system has been created for two reasons:
  • create a type system that gathers only the necessary informations from the metadata tree (we don't need the whole debug informations)
  • make the public headers of the DragonFFI library free from any LLVM headers (so that the whole LLVM headers aren't needed to use the library)
Once we've got this type system, the DragonFFI API for calling C functions is this one:

DFFI FFI([...]);
// This will declare puts as a function that returns int and takes a const
// char* as an argument. We could also create this function type by hand.
CompilationUnit CU = FFI.cdef("int puts(const char* s);", [...]);
NativeFunc F = CU.getFunction("puts");
const char* s = "hello world!";
void* Args[] = {&s};
int Ret;
F.call(&Ret, Args);

So, basically, a pointer to the returned data and an array of void* is given to DragonFFI. Each void* value is a pointer to the data that must be passed to the underlying function. So the last missing piece of the puzzle is the code that takes this array of void* (and pointer to the returned data) and calls puts, so a function like this:

void call_puts(void* Ret, void** Args) {
  *((int*)Ret) = puts((const char*) Args[0]);
}

We call these "function wrappers" (how original! :)). One advantage of this signature is that it is a generic signature, which can be used in the implementation of DragonFFI. Supposing we manage to compile at run-time this function, we can then call it trivially as in the following:

typedef void(*puts_call_ty)(void*, void**);
puts_call_ty Wrapper = /* pointer to the compiled wrapper function */;
Wrapper(Ret, Args);

Generating and compiling a function like this is something Clang/LLVM is able to do. For the record, this is also what libffi mainly does, by generating the necessary assembly by hand. We optimize the number of these wrappers in DragonFFI, by generating them for each different function type. So, the actual wrapper that would be generated for puts is actually this one:

void __dffi_wrapper_0(int32_t( __attribute__((cdecl)) *__FPtr)(char *), int32_t *__Ret, void** __Args) {
  *__Ret = (__FPtr)(*((char **)__Args[0]));
}

For now, all the necessary wrappers are generated when the DFFI::cdef or DFFI::compile APIs are used. The only exception where they are generated on-the-fly (when calling CompilationUnit::getFunction) is for variadic arguments. One possible evolution is to let the user chooses whether he wants this to happen on-the-fly or not for every declared function.

Issues with Clang

There is one major issue with Clang that we need to hack around in order to have the DFFI::cdef functionality: unused declarations aren't emitted by Clang (even when using -g -femit-all-decls).

Here is an example, produced from the following C code:

typedef struct {
  short a;
  int b;
} A;

void print_A(A s);
$ clang -S -emit-llvm -g -femit-all-decls -o - a.c |grep print_A |wc -l
0

The produced LLVM IR does not contain a function named print_A! The hack we temporarily use parses the clang AST and generates temporary functions that looks like this:

void __dffi_force_decl_print_A(A s) { }

This forces LLVM to generate an empty function named __dffi_force_decl_print_A with the good arguments (and associated debug informations).

This is why DragonFFI proposes another API, DFFI::compile. This API does not force declared-only functions to be present in the LLVM IR, and will only expose functions that end up naturally in the LLVM IR after optimizations.

If someone has a better idea to handle this, please let us know!

Python bindings

Python bindings were the first ones to have been written, simply because it's the "high level" language I know best.  Python provides its own set of challenges, but we will save that for another blog post.  These Python bindings are built using pybind11, and provides their own set of C types. Lots of example of what can be achieved can be found here and here.

Project status

DragonFFI currently supports Linux, OSX and Windows OSes, running on Intel 32 and 64-bits CPUs. Travis is used for continuous integration, and every changes is validated on all these platforms before being integrated.

The project will go from alpha to beta quality when the 0.3 version will be out (which will bring Travis and Appveyor CI integration and support for variadic functions). The project will be considered stable once these things happen:
  • user and developer documentations exist!
  • another foreign language is supported (JS? Ruby?)
  • the DragonFFI main library API is considered stable
  • a non negligible list of tests have been added
  • all the things in the TODO file have been done :)

Various ideas for the future

Here are various interesting ideas we have for the future. We don't know yet when they will be implemented, but we think some of them could be quite nice to have.

Parse embedded DWARF information

As the entry point of DragonFFI are DWARF informations, we could imagine parsing these debug informations from shared libraries that embed them (or provide them in a separate file). The main advantage is that all the necessary information for doing the FFI right are in one file, the header files are no longer required. The main drawback is that debug informations tend to take a lot of space (for instance, DWARF informations take 1.8Mb for libarchive 3.32 compiled in release mode, for an original binary code size of 735Kb), and this brings us to the next idea.

Lightweight debug info?

The DWARF standard allows to define lots of information, and we don't need all of them in our case. We could imagine embedding only the necessary DWARF objects, that is just the necessary types to call the exported functions of a shared library. One experiment of this is available here: https://github.com/aguinet/llvm-lightdwarf. This is an LLVM optimisation pass that is inserted at the end of the optimisation pipeline, and parse metadata to only keep the relevant one for DragonFFI. More precisely, it only keeps the dwarf metadata related to exported and visible functions, with the associated types. It also keeps debug information of global variables, even thought these ones aren't supported yet in DragonFFI. It also does some unconventional things, like replacing every file and directory by "_", to save space. "Fun" fact, to do this, it borrows some code from the LLVM bitcode "obfuscator" included in recent Apple's clang version, that is used to anonymize some information from the LLVM bitcode that is sent with tvOS/iOS applications (see http://lists.llvm.org/pipermail/llvm-dev/2016-February/095588.html for more information).

Enough talking, let's see some preliminary results (on Linux x64):
  • on libarchive 3.3.2, DWARF goes from 1.8Mb to 536Kb, for an original binary code size of 735Kb
  • on zlib 1.2.11, DWARF goes from 162Kb to 61Kb, for an original binary code size of 99Kb
The instructions to reproduce this are available in the README of the LLVM pass repository.
We can conclude that defining this "light" DWARF format could be a nice idea. One other thing that could be done is defining a new binary format, that would be thus more space-efficient, but there are drawbacks going this way:
  • debug informations are well supported on every platform nowadays: tools exist to parse them, embed/extract them from binary, and so on
  • we already got DWARD and PDB: https://xkcd.com/927/
Nevertheless, it still could be a nice experiment to try and do this, figuring out the space won and see if this is worth it!

As a final note, these two ideas would also benefit to libffi, as we could process these formats and create libffi types!

JIT code from the final language (like Python) to native function code

One advantage of embedding a full working C compiler is that we could JIT the code from the final language glue to the final C function call, and thus limit the performance impact of this glue code.
Indeed, when a call is issued from Python, the following things happen:
  • arguments are converted from Python to C according to the function type
  • the function pointer and wrapper and gathered from DragonFFI
  • the final call is made
All this process involves basically a loop on the types of the arguments of the called function, which contains a big switch case. This loop generates the array of void* values that represents the C arguments, which is then passed to the wrapper. We could JIT a specialised version of this loop for the function type, inline the already-compiled wrapper and apply classical optimisation on top of the resulting IR, and get a straightforward conversion code specialized for the given function type, directly from Python to C.

One idea we are exploring is combining easy::jit (hello fellow Quarkslab teammates!) with LLPE to achieve this goal.

Reducing DragonFFI library size

The DragonFFI shared library embed statically compiled versions of LLVM and Clang. The size of the final shared library is about 55Mb (stripped, under Linux x64). This is really really huge, compared for instance to the 39Kb of libffi (also stripped, Linux x64)!

Here are some idea to try and reduce this footprint:
  • compile DragonFFI, Clang and LLVM using (Thin) LTO, with visibility hidden for both Clang and LLVM. This could have the effect of removing code from Clang/LLVM that isn't used by DragonFFI.
  • make DragonFFI more modular: - one core module that only have the parts from CodeGen that deals with ABIs. If the types and function prototypes are defined "by hand" (without DFFI::cdef), that's more or less the only part that is needed (with LLVM obviously) - one optional module that includes the full clang compiler (to provide the DFFI::cdef and DFFI::compile APIs)
Even with all of this, it seems to be really hard to match the 39Kb of libffi, even if we remove the cdef/compile API from DragonFFI. As always, pick the right tool for your needs :)

Conclusion

Writing the first working version of DragonFFI has been a fun experiment, that made me discover new parts of Clang/LLVM :) The current goal is to try and achieve a first stable version (see above), and experiment with the various cited ideas.

It's a really long road, so feel free to come on #dragonffi on FreeNode for any questions/suggestions you might have, (inclusive) or if you want to contribute!

Acknowledgments

Thanks to Serge «sans paille» Guelton for the discussions around the Python bindings, and for helping me finding the name of the project :) (one of the most difficult task). Thanks also to him, Fernand Lone-Sang and Kévin Szkudlapski for their review of this blog post!

Thursday, March 8, 2018

International Women's Day: Celebrating all the women in the LLVM Community!


Today is International Women's Day! To all the women in the LLVM community, thank you for all your contributions!

The LLVM Foundation values diversity within the LLVM community and the field of compilers and tools. Our Women in Compilers and Tools program began in 2015 with a birds of a feather discussion during the US LLVM Developers' Meeting and we have been expanding it over the years.

In 2017, we were a sponsor of the Grace Hopper Conference. With the help of community members Anna Zaks and David Blaikie, the LLVM Foundation had a booth at the career fair to introduce women to LLVM and encourage them to become contributors. It was very exciting to learn that many women knew of LLVM, were using it in their classes or research, using it in their career, or were interested in learning more. We hopefully encouraged more women to get involved with LLVM, compilers, and open source.

The LLVM Foundation was also a sponsor of the Programming Language Mentoring Workshop at SPLASH 2017. Our sponsorship went towards the travel costs for many women and other minorities to attend this workshop. The workshop focused on encouraging and preparing students to enter research careers in the field of programming languages, compilers, and related fields and to provide first hand perspectives on graduate school.

We hosted our first Women in Compilers & Tools reception before the 2017 US LLVM Developers' Meeting. Anna Zaks and Alice Chan participated in a panel discussion about the challenges and experiences that they have encountered in their careers and within the open source community. The event was attended by 60 members of the LLVM community.

In 2018, we look forward to another year of expanding our program. The LLVM Foundation will again sponsor the Grace Hopper Conference and we are looking for LLVM community members to help out at the career booth (more details to come). We will be having two Women in Compilers and Tools events. The first will have a reception and panel discussion before the 2018 EuroLLVM Developers' Meeting. Get your tickets here. The second will be before the 2018 US LLVM Developers' Meeting and details will be announced in the coming months.

The LLVM Foundation thanks the LLVM community and its sponsors for supporting this work. If you want to participate in the discussion or receive notifications on events, please join the Women in Compilers and Tools mailing list.

Question for the LLVM Foundation? Email us at llvm-foundation@lists.llvm.org.



Monday, March 5, 2018

Clang is now used to build Chrome for Windows

As of Chrome 64, Chrome for Windows is compiled with Clang. We now use Clang to build Chrome for all platforms it runs on: macOS, iOS, Linux, Chrome OS, Android, and Windows. Windows is the platform with the second most Chrome users after Android according to statcounter, which made this switch particularly exciting.

Clang is the first-ever open-source C++ compiler that’s ABI-compatible with Microsoft Visual C++ (MSVC) – meaning you can build some parts of your program (for example, system libraries) with the MSVC compiler (“cl.exe”), other parts with Clang, and when linked together (either by MSVC’s linker, “link.exe”, or LLD, the LLVM project’s linker – see below) the parts will form a working program.

Note that Clang is not a replacement for Visual Studio, but an addition to it. We still use Microsoft’s headers and libraries to build Chrome, we still use some SDK binaries like midl.exe and mc.exe, and many Chrome/Win developers still use the Visual Studio IDE (for both development and for debugging).

This post discusses numbers, motivation, benefits and drawbacks of using Clang instead of MSVC, how to try out Clang for Windows yourself, project history, and next steps. For more information on the technical side you can look at the slides of our 2015 LLVM conference talk, and the slides linked from there.

Numbers

This is what most people ask about first, so let’s talk about it first. We think the other sections are more interesting though.

Build time

Building Chrome locally with Clang is about 15% slower than with MSVC. (We’ve heard that Windows Defender can make Clang builds a lot slower on some machines, so if you’re seeing larger slowdowns, make sure to whitelist Clang in Windows Defender.) However, the way Clang emits debug info is more parallelizable and builds with a distributed build service (e.g. Goma) are hence faster.

Binary size

Chrome installer size gets smaller for 64-bit builds and slightly larger for 32-bit builds using Clang. The same difference shows in uncompressed code size for regular builds as well (see the tracking bug for Clang binary size for many numbers). However, compared to MSVC builds using link-time code generation (LTCG) and profile-guided optimization (PGO) Clang generates larger code in 64-bit for targets that use /O2 but smaller code for targets that use /Os. The installer size comparison suggests Clang's output compresses better.

Some raw numbers for versions 64.0.3278.2 (MSVC PGO) and 64.0.3278.0 (Clang). mini_installer.exe is Chrome’s installer that users download, containing the LZMA-compressed code. chrome_child.dll is one of the two main dlls; it contains Blink and V8, and generally has many targets that are built with /O2. chrome.dll is the other main dll, containing the browser process code, mostly built with /Os.



mini_installer.exe
chrome.dll
chrome_child.dll
chrome.exe
32-bit win-pgo
45.46 MB
36.47 MB
53.76 MB
1.38 MB
32-bit win-clang
45.65 MB
(+0.04%)
42.56 MB (+16.7%)
62.38 MB
(+16%)
1.45 MB
(+5.1%)
64-bit win-pgo
49.4 MB
53.3 MB
65.6 MB
1.6 MB
64-bit win-clang
46.27 MB
(-6.33%)
50.6 MB
(-5.1%)
72.71 MB
(+10.8%)
1.57 MB
(-1.2%)

Performance

We conducted extensive A/B testing of performance. Performance telemetry numbers are about the same for MSVC-built and clang-built Chrome – some metrics get better, some get worse, but all of them are within 5% of each other. The official MSVC builds used LTCG and PGO, while the Clang builds currently use neither of these. This is potential for improvement that we look forward to exploring. The PGO builds took a very long time to build due to the need for collecting profiles and then building again, and as a result, the configuration was not enabled on our performance-measurement buildbots. Now that we use Clang, the perf bots again track the configuration that we ship.

Startup performance was worse in Clang-built Chrome until we started using a link-order file – a form of “PGO light” .

Stability

We A/B-tested stability as well and found no difference between the two build configurations.

Motivation

There were many motivating reasons for this project, the overarching theme being the benefits of using the same compiler across all of Chrome’s platforms, as well as the ability to change the compiler and deploy those changes to all our developers and buildbots quickly. Here’s a non-exhaustive list of examples.
  • Chrome is heavily using technology that’s based on compiler instrumentation (ASan, CFI, ClusterFuzz—uses ASan). Clang supports this instrumentation already, but we can’t add it to MSVC. We previously used after-the-fact binary instrumentation to mitigate this a bit, but having the toolchain write the right bits in the first place is cleaner and faster.
  • Clang enables us to write compiler plugins that add Chromium-specific warnings and to write tooling for large-scale refactoring. Chromium’s code search can now learn to index Windows code.
  • Chromium is open-source, so it’s nice if it’s built with an open-source toolchain.
  • Chrome runs on 6+ platforms, and most developers are only familiar with 1-3 platforms. If your patch doesn’t compile on a platform you’re unfamiliar with, due to a compiler error that you can’t locally reproduce on your local development machine, it’ll take you a while to fix. On the other hand, if all platforms use the same compiler, if it builds on your machine then it’s probably going to build on all platforms.
  • Using the same compiler also means that compiler-specific micro-optimizations will help on all platforms (assuming that the same -O flags are used on all platforms – not yet the case in Chrome, and only on the same ISAs – x86 vs ARM will stay different).
  • Using the same compiler enables cross-compiling – developers who feel most at home on a Linux box can now work on Windows-specific code, from their Linux box (without needing to run Wine).
  • We can continuously build Chrome trunk with Clang trunk to find compiler regressions quickly. This allows us to update Clang every week or two. Landing a major MSVC update in Chrome usually took a year or more, with several rounds of reporting internal compiler bugs and miscompiles. The issue here isn’t that MSVC is more buggy than Clang – it isn’t, all software is buggy – but that we can continuously improve Clang due to Clang being open-source.
  • C++ receives major new revisions every few years. When C++11 was released, we were still using six different compilers, and enabling C++11 was difficult. With fewer compilers, this gets much easier.
  • We can prioritize compiler features that are important to us. For example:

Of course, not all – or even most – of these reasons will apply to other projects.

Benefits and drawbacks of using Clang instead of Visual C++

Benefits of using Clang, if you want to try for your project:
  • Clang supports 64-bit inline assembly. For example, in Chrome we built libyuv (a video format conversion library) with Clang long before we built all of Chrome with it. libyuv had highly-tuned 64-bit inline assembly with performance not reachable with intrinsics, and we could just use that code on Windows.
  • If your project runs on multiple platforms, you can use one compiler everywhere. Building your project with several compilers is generally considered good for code health, but in Chrome we found that Clang’s diagnostics found most problems and we were mostly battling compiler bugs (and if another compiler has a great new diagnostic, we can add that to Clang).
  • Likewise, if your project is Windows-only, you can get a second compiler’s opinion on your code, and Clang’s warnings might find bugs.
  • You can use Address Sanitizer to find memory bugs.
  • If you don’t use LTCG and PGO, it’s possible that Clang might create faster code.
  • Clang’s diagnostics and fix-it hints.
There are also drawbacks:
  • Clang doesn’t support C++/CX or #import “foo.dll”.
  • MSVC offers paid support, Clang only gives you the code and the ability to write patches yourself (although the community is very active and helpful!).
  • MSVC has better documentation.
  • Advanced debugging features such as Edit & Continue don’t work when using Clang.

How to use

If you want to give Clang for Windows a try, there are two approaches:
  1. You could use clang-cl, a compiler driver that tries to be command-line flag compatible with cl.exe (just like Clang tries to be command-line flag compatible with gcc). The Clang user manual describes how you can tell popular Windows build systems how to call clang-cl instead of cl.exe. We used this approach in Chrome to keep the Clang/Win build working alongside the MSVC build for years, with minimal maintenance cost. You can keep using link.exe, all your current compile flags, the MSVC debugger or windbg, ETW, etc. clang-cl even writes warning messages in a format that’s compatible with cl.exe so that you can click on build error messages in Visual Studio to jump to the right file and line. Everything should just work.
  2. Alternatively, if you have a cross-platform project and want to use gcc-style flags for your Windows build, you can pass a Windows triple (e.g. --target=x86_64-windows-msvc) to regular Clang, and it will produce MSVC-ABI-compatible output. Starting in Clang 7.0.0, due Fall 2018, Clang will also default to CodeView debug info with this triple.
Since Clang’s output is ABI-compatible with MSVC, you can build parts of your project with clang and other parts with MSVC. You can also pass /fallback to clang-cl to make it call cl.exe on files it can’t yet compile (this should be rare; it never happens in the Chrome build).

clang-cl accepts Microsoft language extensions needed to parse system headers but tries to emit -Wmicrosoft-foo warnings when it does so (warnings are ignored for system headers). You can choose to fix your code, or pass -Wno-microsoft-foo to Clang.

link.exe can produce regular PDB files from the CodeView information that Clang writes.

Project History

We switched chrome/mac and chrome/linux to Clang a while ago. But on Windows, Clang was still missing support for parsing many Microsoft language extensions, and it didn’t have any Microsoft C++ ABI-compatible codegen at all. In 2013, we spun up a team to improve Clang’s Windows support, consisting half of Chrome engineers with a compiler background and half of other toolchain people. In mid-2014, Clang could self-host on Windows. In February 2015, we had the first fallback-free build of 64-bit Chrome, in July 2015 the first fallback-free build of 32-bit Chrome (32-bit SEH was difficult). In Oct 2015, we shipped a first clang-built Chrome to the Canary channel. Since then, we’ve worked on improving the size of Clang’s output, improved Clang’s debug information (some of it behind -instcombine-lower-dbg-declare=0 for now), and A/B-tested stability and telemetry performance metrics.

We use versions of Clang that are pinned to a recent upstream revision that we update every one to three weeks, without any local patches. All our work is done in upstream LLVM.

Mid-2015, Microsoft announced that they were building on top of our work of making Clang able to parse all the Microsoft SDK headers with clang/c2, which used the Clang frontend for parsing code, but cl.exe’s codegen to generate code. Development on clang/c2 was halted again in mid-2017; it is conceivable that this was related to our improvements to MSVC-ABI-compatible Clang codegen quality. We’re thankful to Microsoft for publishing documentation on the PDB file format, answering many of our questions, fixing Clang compatibility issues in their SDKs, and for giving us publicity on their blog! Again, Clang is not a replacement for MSVC, but a complement to it.

Opera for Windows is also compiled with Clang starting in version 51.

Firefox is also looking at using clang-cl for building Firefox for Windows.

Next Steps

Just as clang-cl is a cl.exe-compatible interface for Clang, lld-link is a link.exe-compatible interface for lld, the LLVM linker. Our next step is to use lld-link as an alternative to link.exe for linking Chrome for Windows. This has many of the same advantages as clang-cl (open-source, easy to update, …). Also, using clang-cl together with lld-link allows using LLVM-bitcode-based LTO (which in turn enables using CFI) and using PE/COFF extensions to speed up linking. A prerequisite for using lld-link was its ability to write PDB files.
We’re also considering using libc++ instead of the MSVC STL – this allows us to instrument the standard library, which is again useful for CFI and Address Sanitizer.

In Closing

Thanks to the whole LLVM community for helping to create the first new production C++ compiler for Windows in over a decade, and the first-ever open-source C++ compiler that’s ABI-compatible with MSVC!

Thursday, March 1, 2018

EuroLLVM'18 developers' meeting program

The LLVM Foundation is excited to announce the program for the EuroLLVM'18 developers' meeting (April 16 - 17 in Bristol/UK) !

Keynotes

Tutorials

Talks

BoFs

Student Research Competition

Lightning Talks

Posters

If you are interested in any of this talks, you should register to attend the EuroLLVM'18. Tickets are limited !

More information about the EuroLLVM'18 is available here

Wednesday, February 14, 2018

LLVM accepted to 2018 Google Summer of Code!

We are excited to announce the LLVM project has been accepted to 2018 Google Summer of Code!

What is Google Summer of Code?

Google Summer of Code (GSoC) is a global program focused on introducing students to open source software development. Students work on a 3 month programming project with an open source organization during their break from university. There are several benefits to this program for both the students and LLVM:

  • Inspire students to get involved with open source, compilers and LLVM
  • Give students exposure to real-world software development while getting paid a stipend
  • Allow students to do paid work related to their academic pursuits versus getting an unrelated summer job
  • Bring new developers into the LLVM project
  • Some LLVM bugs get fixed or new features get added

Students - Apply now! 

Ok, so you can't apply right now as the official application to GSoC does not open until March 12, 2018, but you must begin discussing your project on the LLVM mailing lists well before that date. There are many open projects listed on our webpage. Once you have selected a project, you will discuss it on the appropriate mailing list.

If you have an idea for a project that is not listed, you can always propose it on the list as well and seek out a mentor.

Key Dates to Remember

We have listed a few key dates here, but always consult the official GSoC timeline to confirm.

  • March 12 (16:00 UTC) - Applications open
  • March 27 (16:00 UTC) - Deadline to file your application
  • April 23 (16:00 UTC) - Accepted student proposals are announced
  • May 14 - Coding begins


LLVM Developers - Consider being mentor!

This program is not a success without our mentors. Thank you to all that have all who have already volunteered! If you have never mentored a GSoC project but are curious, it is not too late to volunteer! You can either select an open project without a mentor or propose your own. Make sure to get it listed on the webpage so that students can see it as an option.

If mentoring just isn't an option for you at this time, consider helping the project out my spreading the word about GSoC.

Questions?

If you have questions about the program for the organizers, please email gsoc@lists.llvm.org. Project specific questions should be sent to the appropriate developer mailing list instead.

Monday, January 8, 2018

Improving Link Time on Windows with clang-cl and lld

One of our goals in bringing clang and lld to Windows has always been to improve developer experience, and what is it that developers want the most?  Faster build times!  Recently, our focus has been on improving link time because it's the step that's the hardest to parallelize so we can't fall back on the time honored tradition of throwing more cores at it.

Of the various steps involved in linking, generating the debug info (which, on Windows, is a PDB file) is by far the slowest since it involves merging O(# of linker inputs) sequences of type records, most of which are duplicate anyway.  For example, if two cpp files both include <string>, then both of those object files will have hundreds of duplicate type records that need to be de-duplicated during the link step.  This means you have to compute O(M x N) hash values, even though only a small fraction of those ultimately contribute to the final PDB.

Several strategies have been invented to deal with this over the years and try to make linking faster.  Many years ago, Microsoft introduced the notion of a Type Server (enabled via /Zi compiler option in MSVC), which moves some of the work into the compiler (to take advantage of parallelism).  More recently we have been given the /DEBUG:FASTLINK linker option which attempts to solve the problem by not merging types at all in the linker.  However, each of these strategies has its own set of disadvantages, and neither can be considered perfect for all use cases.

In this blog post, we'll first go over some technical background about CodeView so that we can understand the problem, followed by a summary of existing attempts to speed up type merging.  Then, we'll describe a novel extension to the PE/COFF file format which speeds up linking by offloading part of the work required to de-duplicate types to the compiler and using a new algorithm which uniquely identifies type records even across input files, and discuss the various tradeoffs of each approach.  Finally, we'll present some benchmarks and discuss how you can try this out in clang-cl and lld today.


Background

Consider a simple structure in C++, defined like this a header file:

     struct Node {
       Node *Next = nullptr;
       Node *Prev = nullptr;
       int Value = 0;
     };

Since each compilation happens independently of every other compilation, the compiler cannot assume any other translation unit will ever emit the records necessary to describe this type.  As a result, to guarantee that the type makes it into the final PDB, every compiler instance that encounters this definition must emit type information for this type.  So the record will be serialized by the compiler into a series of records that looks roughly like this:

0x1004 | LF_STRUCTURE [size = 40] `Node`
         unique name: `.?AUNode@@`
         vtable: <none>
         base list: <none>
         field list: <none>
         options: forward ref | has unique name
0x1005 | LF_POINTER [size = 12]
         referent = 0x1004
         mode = pointer
         opts = None
         kind = ptr32
0x1006 | LF_FIELDLIST [size = 52]
         - LF_MEMBER
           name = `Next`
           Type = 0x1005
           Offset = 0
           attrs = public
         - LF_MEMBER
           name = `Prev`
           Type = 0x1005
           Offset = 4
           attrs = public
         - LF_MEMBER
           name = `Value`
           Type = 0x0074 (int)
           Offset = 8
           attrs = public
0x1007 | LF_STRUCTURE [size = 40] `Node`
         unique name: `.?AUNode@@`
         vtable: <none>
         base list: <none>
         field list: 0x1006
         options: has unique name
The values on the left correspond to the types index in the type sequence and depend on what types have already been encountered, while other types can the refer to them (for example, referent = 0x1004) means that this record is a pointer to whatever the type at index 0x1004 was.

As a result of this design, another compilation unit which includes the same header file will need to emit this exact same type, with the only difference being the indices (since the other compilation may encounter other types before this one, causing the ordering to be different).

In short, type indices only make sense within the context of a single type sequence (i.e. compiland), but since the linker needs to see across all object files, it has to have some way of identifying whether a type from object file A is isomorphic to a different type from object file B, even if its type indices might be different numerically from any previously seen type. 

This algorithm, henceforth referred to as type merging, is the primary consumer of CPU cycles during linking (measured in LLD, and estimated in MSVC linker by comparing /DEBUG:FULL vs /DEBUG:FASTLINK times), and as such it is the portion of the linking process which this blog post presents a new solution to.

Existing Solutions

It’s worthwhile to discuss some of the existing attempts to reduce the cost associated with type merging so that we can compare and contrast their various pros and cons.

Type Servers (/Zi)


The /Zi compiler option was one of the first attempts to address type merging speed, and it dates back many years.  The idea behind type servers is to offload the work of de-duplication from the linking phase to the compilation phase.  Most build systems already support parallel compilation, and even if they don’t cl.exe supports it natively via the /MP compiler switch, so there is no roadblock to anyone taking advantage of parallel compilation. 



To implement type servers, each compilation process communicates via IPC with a single process (mspdbsrv.exe) whose job is to de-duplicate type records on the fly, and when a record is isomorphic to an existing record, the type server communicates back the previously saved index, and when it is new it sends back a new index.  This allows type deduplication to happen mostly in parallel, but adding some overhead to each compilation (since there is contention over a global lock) in return for significantly reduced link times, since types will already have been merged.


Type servers bring with them some disadvantages though, so we enumerate them here:
  1. Type servers add significant context switching and global lock contention to the compilation phase, reducing parallelism and degrading overall system performance while a build is in process.  While some performance is reclaimed from the linker, some is sacrificed due to the use of a global system lock.  It’s still a net win, but as it is not free, it leaves open the possibility that we may be able to achieve better parallelism using a different approach.
  2. The type server process itself (mspdbsrv.exe) introduces a single point of failure.  When it crashes (we see C1033 several times per day on Chrome, for example, which seems to indicate an mspdbsrv.exe crash) it could trigger a full rebuild if the type server PDB file is left in a corrupt state.
  3. mspdbsrv is incompatible with distributed builds, which is a show-stopper for large applications that can take several hours to build on normal workstations.  Type servers operate only via local IPC.  While multi-processing works well for small applications, many large products have build farms that distribute compilations among tens or hundreds of physical machines.  Type servers are incompatible with this scenario.

Fastlink PDBs

Fastlink PDBs are a relatively recent introduction, and the approach used by this solution is to eliminate type merging entirely.  To support this, special metadata is set in the PDB file to indicate to the tool that this is a fastlink PDB, and when the tool (e.g. debugger) encounters this metadata, it will fetch all type information from the original object file, rather than from the PDB.  As before, there are several disadvantages to this approach, enumerated here:
  1. The pdbcopy utility is almost unusable with fastlink PDBs for performance reasons.
  2. Since type merging doesn’t happen, indexing of type information also doesn’t happen (since the expensive part of building an index -- the hashing -- comes for free when you were hashing the record anyway).  This leads to degradation in the debugger user experience, since waits which previously happened only at build time now happen at debug-time.
  3. Fastlink PDBs are not portable.  The PDB references the object files by path, so if you copy the PDB and object files to a different machine (or even different path on the same machine) for archival purposes, they can no longer be debugged.  This is a deal-breaker for using it on production builds
  4. Symbols can’t be enumerated in a Fastlink PDB.  This is most obvious if you attempt to use DIA SDK on a Fastlink PDB, where it will simply refuse to do anything at all.  This means that the only externally supported way of querying debug info for users is impossible against a Fastlink PDB.  Beyond that, however, it also means that even Microsoft’s own tools which need to enumerate symbols cannot use any standard API for doing so.  For example, WinDbg doesn’t fully support Fastlink PDBs, and many workflows are broken by the use of them, even using supported Microsoft tools.
  5. It has several serious stability issues which make it unusable on large projects  [ref].  This is probably related to point 4 above, namely the fact that every tool that wants to be able to work with a Fastlink PDB needs to use different code than the SDK that has been tested and battle-hardened through years of development.
  6. When compiling with clang-cl and linking with /debug:fastlink the compiler has to be instructed to emit additional debug information, making .obj files about 29% larger.

Clang's Solution - The COFF .debug$H section

This new approach tries to combine the ideas behind type servers and fastlink PDBs.  Like type servers, it attempts to offload the work of de-duplication to the compilation phase so that it can be done in parallel.  However, it does so using an algorithm with the property that the resulting hash can be used to identify a type record even across type streams.  Specifically, if two records have the same hash, they are the same record even if they are from different object files.  If you can take it on faith that such an algorithm exists (which will be henceforth referred to as a global hash), then the amount of work that the linker needs to perform is greatly reduced.  And the work that it does still have to do can be done much quicker.  Perhaps most importantly, it produces a byte-for-byte identical PDB to when the option is not used, meaning all of the issues surrounding Fastlink PDBs and compatibility are gone.

Previously, the linker would do something that looks roughly like this:

     HashTable<Type> HashedTypes;
     vector<Type> MergedTypes;
     for (ObjectFile &Obj : Objects) {
       for (Type &T : Obj.types()) {
         remapAllTypeIndices(MergedTypes, T);

         if (!HashedTypes.try_insert(T))
           continue;
         MergedTypes.push_back(T);
       }
     }
The important observations here are:
  1. remapAllTypeIndices is called unconditionally for every type in every object file.
  2. A hash of the type is computed unconditionally for every type
  3. At least one full record comparison is done for every type.  In practice it turns out to be much more, because hash buckets are computed modulo table size, so there will actually be 1 full record comparison for every probe.
Given a global hash function as described above, the algorithm can be re-written like this:
      HashMap<SHA1, int> HashToIndex;
      vector<Type> OrderedTypes;
      for (ObjectFile &Obj : Objects) {
        auto Hashes = Obj.DebugHSectionHashes;
        for (int I=0; I < Obj.NumTypes; ++I) {
          int NextIndex = OrderedTypes.size();
          if (!HashToIndex.try_emplace(Hashes[I], NextIndex))
            continue;
          remapAllTypeIndices(T);
          OrderedTypes.push_back(T);
        }
      }

While this appears very similar, its performance characteristics are quite different.
  1. remapAllTypeIndices is only called when the record is actually new.  Which, as we discussed earlier, is a small fraction of the time over many linker inputs.
  2. A hash of the type is never computed by the linker.  It is simply there in the object file (the exception to this is mixed linker inputs, discussed earlier, but those are a small fraction of input files).
  3. Full record comparisons never happen.  Since we are using a strong hash function with negligible chance of false collisions, and since the hash of a record provides equality semantics across streams, the hash is as good as the record itself.

Combining all of these points, we get an algorithm that is extremely cache friendly.  Amortized over all input files, most records during type merging are cache hits (i.e. duplicate records).  With this algorithm when we get a cache hit, the only two data structures that are accessed are:
  1. An array of contiguous hash values.
  2. An array of contiguous hash buckets.
Since we never do full equality comparison (which would blow out the L1 and sometimes even L2 cache due to the average size of a type record being larger than a cache line) the algorithm here is very fast.

We’ve deferred discussion of how to create such a hash up until now, but it is actually fairly straightforward.  We use what is known as a “tree hash” or “Merkle tree”.  The idea is to pass bytes from a type record directly to the hash function up until the point we get to a type index.  Then, instead of passing the numeric value of the type index to the hash function, we pass the previously computed hash of the record that is being referenced.

Such a hash is very fast to compute in the compiler because the compiler must already hash types anyway, so the incremental cost to emit this to the .debug$H section is negligible.  For example, when a type is encountered in a translation unit, before you can add that type to the object file’s .debug$T section, it must first be verified that the type has not already been added.  And since this is happening naturally in the order in which types are encountered, all that has to be done is to save these hash values in an array indexed by type index, and subsequent hash operations will have O(1) access to all of the information needed to compute this merkle hash.
  

Mixed Input Files and Compiler/Linker Compatibility

A linker must be prepared to deal with a mixed set of input files.  For example, while a particular compiler may choose to always emit .debug$H sections, a linker must be prepared to link objects that for whatever reason do not have this section.  To handle this, the linker can examine all inputs up front and manually compute hashes for inputs with missing .debug$H sections.  In practice this proves to be a small fraction and the penalty for doing this serially is negligible, although it should be noted that in theory this can also be done as a parallel pre-processing step if some use cases show that this has non-negligible cost.

Similarly, the emission of this section in an object file has no impact on linkers which have not been taught to use it.  Since it is a purely additive (and optional) inclusion into the object file, any linker which does not understand it will continue to work exactly as it does today.

The On-Disk Format

Clang uses the following on-disk format for the .debug$H section.

           0x0     : <Section Magic>  (4 bytes)
     0x4     : <Version>        (2 bytes)
     0x6     : <Hash Algorithm> (2 bytes)
     0x8     : <Hash Value>     (N bytes)
     0x8 + N : <Hash Value>     (N bytes)
                    …

Here, “Section Magic” is an arbitrarily chosen 4-byte number whose purpose is to provide some level of certainty that what we’re seeing is a real .debug$H section, and not some section that someone created that accidentally happened to be called that.   Our current implementation uses the value 0x133C9C5, which represents the date of the initial prototype implementation.  But this can be any reasonable value here, as long as it never changes.

“Version” is reserved for future use, so that the format of the section can theoretically change.

“Hash Algorithm” is a value that indicates what algorithm was used to generate the hashes that follow.  As such, the value of N above is also a function of what hash algorithm is used.  Currently, the only proposed value for Hash Algorithm is SHA1 = 0, which would imply N = 20 when Hash Algorithm = 0.  Should it prove useful to have truncated 8-byte SHA1 hashes, we could define SHA1_8 = 1, for example.

Limitations and Pitfalls

The biggest limitation of this format is that it increases object file size.  Experiments locally on fairly large projects show an average aggregate object file size increase of ~15% compared to /DEBUG:FULL (which, for clang-cl, actually makes .debug$H object files smaller than those needed to support /DEBUG:FASTLINK).

There is another, less obvious potential pitfall as well.  The worst case scenario is when no input files have a .debug$H section present, but this limitation is the same in principle even if only a subset of files have a .debug$H section.  Since the linker must agree on a single hash function for all object files, there is the question of what to do when not all object files agree on hash function, or when not all object files contain a .debug$H section.  If the code is not written carefully, you could get into a situation where, for example, no input files contain a .debug$H section so the linker decides to synthesize one on the fly for every input file.  Since SHA1 (for example) is quite slow, this could cause a huge performance penalty.

This limitation can be coded around with some care, however.  For example, tree hashes can be computed up-front in parallel as a pre-processing step.  Alternatively, a hash function could be chosen based on some heuristic estimate of what would likely lead to the fastest link (based on the percentage of inputs that had a .debug$H section, for example).  There are other possibilities as well.  The important thing is to just be aware of this potential pitfall, and if your links become very slow, you'll know that the first thing you should check is "do all my object files have .debug$H sections?"

Finally, since a hash is considered to be identical to the original record, we must consider the possibility of collisions.  That said, this does not appear to be a serious concern in practice.  A single PDB can have a theoretical maximum of 232 type records anyway (due to a type index being 4 bytes).  The following table shows the expected number of type records needed for a collision to exist as a function of hash size.
Hash Size (Bytes)
Average # of records needed for a collision
4
82,137
6
21,027,121
8
5,382,943,231
12
3.53 x 1014
16
2.31 x 1019
20
1.52 x 1024
Given that this is strictly for debug information and not generated code, it’s worth thinking about the severity of a collision.  We feel that an 8-byte hash is probably acceptable for real world use.

Benchmarks

Here we will give some benchmarks on large real world applications (specifically, Chrome and clang).  The times presented are only for the linker.  gn args for each build of chromium are specified at the end..


Toolchain 

Mode
Target
blink_core.dll
content.dll
chrome.dll
clang.exe
MSVC
/DEBUG:FULL
553.11s
205.45s
507.17s
62.45s
MSVC
/DEBUG:FASTLINK
116.77s
56.05s
67.80s
29.37s
lld-link
/DEBUG:FULL
121.17s
42.10s
42.31s
24.14s
lld-link
/DEBUG:GHASH
88.71s
33.30s
34.76s
17.99s


The numbers here indicate a reduction in link time of up to 30% by enabling /DEBUG:GHASH in lld.

It's worth mentioning that lld does not yet have support for incremental linking so we could not compare the cost of an incremental link with /DEBUG:GHASH versus MSVC.  We still expect incremental linking using MSVC under optimal conditions (e.g. change whitespace in a header file) to produce much faster links than lld is currently able to do.

There are several possible avenues for further optimization though, so we will finish up by discussing them.

Further Improvements

There are several ways to improve the times further, which have yet to be explored.

  1. Use a smaller or faster hash.  We use a 20-byte SHA1 hash.  This is not a multiple of cache line size, and in any case the probability of collision is astronomically small even in the largest PDBs, considering that the theoretical limit of a PDB is just under 2^32 possible unique types (due to the 4-byte size of a type index).  SHA1 is also notoriously slow.  It might be interesting to try, for example, a Blake2 set to output an 8 byte hash.  This should give sufficiently low probability of a collision while improving cache performance.  The on-disk format is designed with this flexibility in mind, as different hash algorithms can be specified in the header.
  2. Hashes for compilands with missing .debug$H sections can be computed in parallel before linking.  Currently when we encounter an object file without a .debug$H section, we must synthesize one in the linker.  Our prototype algorithm does this serially for each input.
  3. Symbol records from .debug$S sections can be merged in parallel.  Currently in lld, we first merge type records into the TPI stream, then we iterate symbol records and remap types in each symbol record to correspond to the new type indices.  If we merge types from all modules up front, the symbol records (with the exception of global symbols) can be merged in parallel since they get written to independent streams).

Try it out!

If you're already using clang-cl and lld on Windows today, you can try this out.  There are two flags needed to enable this, one for the compiler and one for the linker:
  1. To enable the emission of a .debug$H section by the compiler, you will need to pass the undocumented -mllvm -emit-codeview-ghash-section flag to clang-cl  (this flag should go away in the future, once this is considered stable and good enough to be turned on by default).
  2. To tell lld to use this information, you will need to pass the /DEBUG:GHASH to lld.
Note that this feature is still considered highly experimental, so we're interested in your feedback (llvm-dev@ mailing list, direct email is ok too) and bug reports (bugs.llvm.org).