LLVM Project Blog

LLVM Project News and Details from the Trenches

Wednesday, February 14, 2018

LLVM accepted to 2018 Google Summer of Code!

We are excited to announce the LLVM project has been accepted to 2018 Google Summer of Code!

What is Google Summer of Code?

Google Summer of Code (GSoC) is a global program focused on introducing students to open source software development. Students work on a 3 month programming project with an open source organization during their break from university. There are several benefits to this program for both the students and LLVM:

  • Inspire students to get involved with open source, compilers and LLVM
  • Give students exposure to real-world software development while getting paid a stipend
  • Allow students to do paid work related to their academic pursuits versus getting an unrelated summer job
  • Bring new developers into the LLVM project
  • Some LLVM bugs get fixed or new features get added

Students - Apply now! 

Ok, so you can't apply right now as the official application to GSoC does not open until March 12, 2018, but you must begin discussing your project on the LLVM mailing lists well before that date. There are many open projects listed on our webpage. Once you have selected a project, you will discuss it on the appropriate mailing list.

If you have an idea for a project that is not listed, you can always propose it on the list as well and seek out a mentor.

Key Dates to Remember

We have listed a few key dates here, but always consult the official GSoC timeline to confirm.

  • March 12 (16:00 UTC) - Applications open
  • March 27 (16:00 UTC) - Deadline to file your application
  • April 23 (16:00 UTC) - Accepted student proposals are announced
  • May 14 - Coding begins


LLVM Developers - Consider being mentor!

This program is not a success without our mentors. Thank you to all that have all who have already volunteered! If you have never mentored a GSoC project but are curious, it is not too late to volunteer! You can either select an open project without a mentor or propose your own. Make sure to get it listed on the webpage so that students can see it as an option.

If mentoring just isn't an option for you at this time, consider helping the project out my spreading the word about GSoC.

Questions?

If you have questions about the program for the organizers, please email gsoc@lists.llvm.org. Project specific questions should be sent to the appropriate developer mailing list instead.

Monday, January 8, 2018

Improving Link Time on Windows with clang-cl and lld

One of our goals in bringing clang and lld to Windows has always been to improve developer experience, and what is it that developers want the most?  Faster build times!  Recently, our focus has been on improving link time because it's the step that's the hardest to parallelize so we can't fall back on the time honored tradition of throwing more cores at it.

Of the various steps involved in linking, generating the debug info (which, on Windows, is a PDB file) is by far the slowest since it involves merging O(# of linker inputs) sequences of type records, most of which are duplicate anyway.  For example, if two cpp files both include <string>, then both of those object files will have hundreds of duplicate type records that need to be de-duplicated during the link step.  This means you have to compute O(M x N) hash values, even though only a small fraction of those ultimately contribute to the final PDB.

Several strategies have been invented to deal with this over the years and try to make linking faster.  Many years ago, Microsoft introduced the notion of a Type Server (enabled via /Zi compiler option in MSVC), which moves some of the work into the compiler (to take advantage of parallelism).  More recently we have been given the /DEBUG:FASTLINK linker option which attempts to solve the problem by not merging types at all in the linker.  However, each of these strategies has its own set of disadvantages, and neither can be considered perfect for all use cases.

In this blog post, we'll first go over some technical background about CodeView so that we can understand the problem, followed by a summary of existing attempts to speed up type merging.  Then, we'll describe a novel extension to the PE/COFF file format which speeds up linking by offloading part of the work required to de-duplicate types to the compiler and using a new algorithm which uniquely identifies type records even across input files, and discuss the various tradeoffs of each approach.  Finally, we'll present some benchmarks and discuss how you can try this out in clang-cl and lld today.


Background

Consider a simple structure in C++, defined like this a header file:

     struct Node {
       Node *Next = nullptr;
       Node *Prev = nullptr;
       int Value = 0;
     };

Since each compilation happens independently of every other compilation, the compiler cannot assume any other translation unit will ever emit the records necessary to describe this type.  As a result, to guarantee that the type makes it into the final PDB, every compiler instance that encounters this definition must emit type information for this type.  So the record will be serialized by the compiler into a series of records that looks roughly like this:

0x1004 | LF_STRUCTURE [size = 40] `Node`
         unique name: `.?AUNode@@`
         vtable: <none>
         base list: <none>
         field list: <none>
         options: forward ref | has unique name
0x1005 | LF_POINTER [size = 12]
         referent = 0x1004
         mode = pointer
         opts = None
         kind = ptr32
0x1006 | LF_FIELDLIST [size = 52]
         - LF_MEMBER
           name = `Next`
           Type = 0x1005
           Offset = 0
           attrs = public
         - LF_MEMBER
           name = `Prev`
           Type = 0x1005
           Offset = 4
           attrs = public
         - LF_MEMBER
           name = `Value`
           Type = 0x0074 (int)
           Offset = 8
           attrs = public
0x1007 | LF_STRUCTURE [size = 40] `Node`
         unique name: `.?AUNode@@`
         vtable: <none>
         base list: <none>
         field list: 0x1006
         options: has unique name
The values on the left correspond to the types index in the type sequence and depend on what types have already been encountered, while other types can the refer to them (for example, referent = 0x1004) means that this record is a pointer to whatever the type at index 0x1004 was.

As a result of this design, another compilation unit which includes the same header file will need to emit this exact same type, with the only difference being the indices (since the other compilation may encounter other types before this one, causing the ordering to be different).

In short, type indices only make sense within the context of a single type sequence (i.e. compiland), but since the linker needs to see across all object files, it has to have some way of identifying whether a type from object file A is isomorphic to a different type from object file B, even if its type indices might be different numerically from any previously seen type. 

This algorithm, henceforth referred to as type merging, is the primary consumer of CPU cycles during linking (measured in LLD, and estimated in MSVC linker by comparing /DEBUG:FULL vs /DEBUG:FASTLINK times), and as such it is the portion of the linking process which this blog post presents a new solution to.

Existing Solutions

It’s worthwhile to discuss some of the existing attempts to reduce the cost associated with type merging so that we can compare and contrast their various pros and cons.

Type Servers (/Zi)


The /Zi compiler option was one of the first attempts to address type merging speed, and it dates back many years.  The idea behind type servers is to offload the work of de-duplication from the linking phase to the compilation phase.  Most build systems already support parallel compilation, and even if they don’t cl.exe supports it natively via the /MP compiler switch, so there is no roadblock to anyone taking advantage of parallel compilation. 



To implement type servers, each compilation process communicates via IPC with a single process (mspdbsrv.exe) whose job is to de-duplicate type records on the fly, and when a record is isomorphic to an existing record, the type server communicates back the previously saved index, and when it is new it sends back a new index.  This allows type deduplication to happen mostly in parallel, but adding some overhead to each compilation (since there is contention over a global lock) in return for significantly reduced link times, since types will already have been merged.


Type servers bring with them some disadvantages though, so we enumerate them here:
  1. Type servers add significant context switching and global lock contention to the compilation phase, reducing parallelism and degrading overall system performance while a build is in process.  While some performance is reclaimed from the linker, some is sacrificed due to the use of a global system lock.  It’s still a net win, but as it is not free, it leaves open the possibility that we may be able to achieve better parallelism using a different approach.
  2. The type server process itself (mspdbsrv.exe) introduces a single point of failure.  When it crashes (we see C1033 several times per day on Chrome, for example, which seems to indicate an mspdbsrv.exe crash) it could trigger a full rebuild if the type server PDB file is left in a corrupt state.
  3. mspdbsrv is incompatible with distributed builds, which is a show-stopper for large applications that can take several hours to build on normal workstations.  Type servers operate only via local IPC.  While multi-processing works well for small applications, many large products have build farms that distribute compilations among tens or hundreds of physical machines.  Type servers are incompatible with this scenario.

Fastlink PDBs

Fastlink PDBs are a relatively recent introduction, and the approach used by this solution is to eliminate type merging entirely.  To support this, special metadata is set in the PDB file to indicate to the tool that this is a fastlink PDB, and when the tool (e.g. debugger) encounters this metadata, it will fetch all type information from the original object file, rather than from the PDB.  As before, there are several disadvantages to this approach, enumerated here:
  1. The pdbcopy utility is almost unusable with fastlink PDBs for performance reasons.
  2. Since type merging doesn’t happen, indexing of type information also doesn’t happen (since the expensive part of building an index -- the hashing -- comes for free when you were hashing the record anyway).  This leads to degradation in the debugger user experience, since waits which previously happened only at build time now happen at debug-time.
  3. Fastlink PDBs are not portable.  The PDB references the object files by path, so if you copy the PDB and object files to a different machine (or even different path on the same machine) for archival purposes, they can no longer be debugged.  This is a deal-breaker for using it on production builds
  4. Symbols can’t be enumerated in a Fastlink PDB.  This is most obvious if you attempt to use DIA SDK on a Fastlink PDB, where it will simply refuse to do anything at all.  This means that the only externally supported way of querying debug info for users is impossible against a Fastlink PDB.  Beyond that, however, it also means that even Microsoft’s own tools which need to enumerate symbols cannot use any standard API for doing so.  For example, WinDbg doesn’t fully support Fastlink PDBs, and many workflows are broken by the use of them, even using supported Microsoft tools.
  5. It has several serious stability issues which make it unusable on large projects  [ref].  This is probably related to point 4 above, namely the fact that every tool that wants to be able to work with a Fastlink PDB needs to use different code than the SDK that has been tested and battle-hardened through years of development.
  6. When compiling with clang-cl and linking with /debug:fastlink the compiler has to be instructed to emit additional debug information, making .obj files about 29% larger.

Clang's Solution - The COFF .debug$H section

This new approach tries to combine the ideas behind type servers and fastlink PDBs.  Like type servers, it attempts to offload the work of de-duplication to the compilation phase so that it can be done in parallel.  However, it does so using an algorithm with the property that the resulting hash can be used to identify a type record even across type streams.  Specifically, if two records have the same hash, they are the same record even if they are from different object files.  If you can take it on faith that such an algorithm exists (which will be henceforth referred to as a global hash), then the amount of work that the linker needs to perform is greatly reduced.  And the work that it does still have to do can be done much quicker.  Perhaps most importantly, it produces a byte-for-byte identical PDB to when the option is not used, meaning all of the issues surrounding Fastlink PDBs and compatibility are gone.

Previously, the linker would do something that looks roughly like this:

     HashTable<Type> HashedTypes;
     vector<Type> MergedTypes;
     for (ObjectFile &Obj : Objects) {
       for (Type &T : Obj.types()) {
         remapAllTypeIndices(MergedTypes, T);

         if (!HashedTypes.try_insert(T))
           continue;
         MergedTypes.push_back(T);
       }
     }
The important observations here are:
  1. remapAllTypeIndices is called unconditionally for every type in every object file.
  2. A hash of the type is computed unconditionally for every type
  3. At least one full record comparison is done for every type.  In practice it turns out to be much more, because hash buckets are computed modulo table size, so there will actually be 1 full record comparison for every probe.
Given a global hash function as described above, the algorithm can be re-written like this:
      HashMap<SHA1, int> HashToIndex;
      vector<Type> OrderedTypes;
      for (ObjectFile &Obj : Objects) {
        auto Hashes = Obj.DebugHSectionHashes;
        for (int I=0; I < Obj.NumTypes; ++I) {
          int NextIndex = OrderedTypes.size();
          if (!HashToIndex.try_emplace(Hashes[I], NextIndex))
            continue;
          remapAllTypeIndices(T);
          OrderedTypes.push_back(T);
        }
      }

While this appears very similar, its performance characteristics are quite different.
  1. remapAllTypeIndices is only called when the record is actually new.  Which, as we discussed earlier, is a small fraction of the time over many linker inputs.
  2. A hash of the type is never computed by the linker.  It is simply there in the object file (the exception to this is mixed linker inputs, discussed earlier, but those are a small fraction of input files).
  3. Full record comparisons never happen.  Since we are using a strong hash function with negligible chance of false collisions, and since the hash of a record provides equality semantics across streams, the hash is as good as the record itself.

Combining all of these points, we get an algorithm that is extremely cache friendly.  Amortized over all input files, most records during type merging are cache hits (i.e. duplicate records).  With this algorithm when we get a cache hit, the only two data structures that are accessed are:
  1. An array of contiguous hash values.
  2. An array of contiguous hash buckets.
Since we never do full equality comparison (which would blow out the L1 and sometimes even L2 cache due to the average size of a type record being larger than a cache line) the algorithm here is very fast.

We’ve deferred discussion of how to create such a hash up until now, but it is actually fairly straightforward.  We use what is known as a “tree hash” or “Merkle tree”.  The idea is to pass bytes from a type record directly to the hash function up until the point we get to a type index.  Then, instead of passing the numeric value of the type index to the hash function, we pass the previously computed hash of the record that is being referenced.

Such a hash is very fast to compute in the compiler because the compiler must already hash types anyway, so the incremental cost to emit this to the .debug$H section is negligible.  For example, when a type is encountered in a translation unit, before you can add that type to the object file’s .debug$T section, it must first be verified that the type has not already been added.  And since this is happening naturally in the order in which types are encountered, all that has to be done is to save these hash values in an array indexed by type index, and subsequent hash operations will have O(1) access to all of the information needed to compute this merkle hash.
  

Mixed Input Files and Compiler/Linker Compatibility

A linker must be prepared to deal with a mixed set of input files.  For example, while a particular compiler may choose to always emit .debug$H sections, a linker must be prepared to link objects that for whatever reason do not have this section.  To handle this, the linker can examine all inputs up front and manually compute hashes for inputs with missing .debug$H sections.  In practice this proves to be a small fraction and the penalty for doing this serially is negligible, although it should be noted that in theory this can also be done as a parallel pre-processing step if some use cases show that this has non-negligible cost.

Similarly, the emission of this section in an object file has no impact on linkers which have not been taught to use it.  Since it is a purely additive (and optional) inclusion into the object file, any linker which does not understand it will continue to work exactly as it does today.

The On-Disk Format

Clang uses the following on-disk format for the .debug$H section.

           0x0     : <Section Magic>  (4 bytes)
     0x4     : <Version>        (2 bytes)
     0x6     : <Hash Algorithm> (2 bytes)
     0x8     : <Hash Value>     (N bytes)
     0x8 + N : <Hash Value>     (N bytes)
                    …

Here, “Section Magic” is an arbitrarily chosen 4-byte number whose purpose is to provide some level of certainty that what we’re seeing is a real .debug$H section, and not some section that someone created that accidentally happened to be called that.   Our current implementation uses the value 0x133C9C5, which represents the date of the initial prototype implementation.  But this can be any reasonable value here, as long as it never changes.

“Version” is reserved for future use, so that the format of the section can theoretically change.

“Hash Algorithm” is a value that indicates what algorithm was used to generate the hashes that follow.  As such, the value of N above is also a function of what hash algorithm is used.  Currently, the only proposed value for Hash Algorithm is SHA1 = 0, which would imply N = 20 when Hash Algorithm = 0.  Should it prove useful to have truncated 8-byte SHA1 hashes, we could define SHA1_8 = 1, for example.

Limitations and Pitfalls

The biggest limitation of this format is that it increases object file size.  Experiments locally on fairly large projects show an average aggregate object file size increase of ~15% compared to /DEBUG:FULL (which, for clang-cl, actually makes .debug$H object files smaller than those needed to support /DEBUG:FASTLINK).

There is another, less obvious potential pitfall as well.  The worst case scenario is when no input files have a .debug$H section present, but this limitation is the same in principle even if only a subset of files have a .debug$H section.  Since the linker must agree on a single hash function for all object files, there is the question of what to do when not all object files agree on hash function, or when not all object files contain a .debug$H section.  If the code is not written carefully, you could get into a situation where, for example, no input files contain a .debug$H section so the linker decides to synthesize one on the fly for every input file.  Since SHA1 (for example) is quite slow, this could cause a huge performance penalty.

This limitation can be coded around with some care, however.  For example, tree hashes can be computed up-front in parallel as a pre-processing step.  Alternatively, a hash function could be chosen based on some heuristic estimate of what would likely lead to the fastest link (based on the percentage of inputs that had a .debug$H section, for example).  There are other possibilities as well.  The important thing is to just be aware of this potential pitfall, and if your links become very slow, you'll know that the first thing you should check is "do all my object files have .debug$H sections?"

Finally, since a hash is considered to be identical to the original record, we must consider the possibility of collisions.  That said, this does not appear to be a serious concern in practice.  A single PDB can have a theoretical maximum of 232 type records anyway (due to a type index being 4 bytes).  The following table shows the expected number of type records needed for a collision to exist as a function of hash size.
Hash Size (Bytes)
Average # of records needed for a collision
4
82,137
6
21,027,121
8
5,382,943,231
12
3.53 x 1014
16
2.31 x 1019
20
1.52 x 1024
Given that this is strictly for debug information and not generated code, it’s worth thinking about the severity of a collision.  We feel that an 8-byte hash is probably acceptable for real world use.

Benchmarks

Here we will give some benchmarks on large real world applications (specifically, Chrome and clang).  The times presented are only for the linker.  gn args for each build of chromium are specified at the end..


Toolchain 

Mode
Target
blink_core.dll
content.dll
chrome.dll
clang.exe
MSVC
/DEBUG:FULL
553.11s
205.45s
507.17s
62.45s
MSVC
/DEBUG:FASTLINK
116.77s
56.05s
67.80s
29.37s
lld-link
/DEBUG:FULL
121.17s
42.10s
42.31s
24.14s
lld-link
/DEBUG:GHASH
88.71s
33.30s
34.76s
17.99s


The numbers here indicate a reduction in link time of up to 30% by enabling /DEBUG:GHASH in lld.

It's worth mentioning that lld does not yet have support for incremental linking so we could not compare the cost of an incremental link with /DEBUG:GHASH versus MSVC.  We still expect incremental linking using MSVC under optimal conditions (e.g. change whitespace in a header file) to produce much faster links than lld is currently able to do.

There are several possible avenues for further optimization though, so we will finish up by discussing them.

Further Improvements

There are several ways to improve the times further, which have yet to be explored.

  1. Use a smaller or faster hash.  We use a 20-byte SHA1 hash.  This is not a multiple of cache line size, and in any case the probability of collision is astronomically small even in the largest PDBs, considering that the theoretical limit of a PDB is just under 2^32 possible unique types (due to the 4-byte size of a type index).  SHA1 is also notoriously slow.  It might be interesting to try, for example, a Blake2 set to output an 8 byte hash.  This should give sufficiently low probability of a collision while improving cache performance.  The on-disk format is designed with this flexibility in mind, as different hash algorithms can be specified in the header.
  2. Hashes for compilands with missing .debug$H sections can be computed in parallel before linking.  Currently when we encounter an object file without a .debug$H section, we must synthesize one in the linker.  Our prototype algorithm does this serially for each input.
  3. Symbol records from .debug$S sections can be merged in parallel.  Currently in lld, we first merge type records into the TPI stream, then we iterate symbol records and remap types in each symbol record to correspond to the new type indices.  If we merge types from all modules up front, the symbol records (with the exception of global symbols) can be merged in parallel since they get written to independent streams).

Try it out!

If you're already using clang-cl and lld on Windows today, you can try this out.  There are two flags needed to enable this, one for the compiler and one for the linker:
  1. To enable the emission of a .debug$H section by the compiler, you will need to pass the undocumented -mllvm -emit-codeview-ghash-section flag to clang-cl  (this flag should go away in the future, once this is considered stable and good enough to be turned on by default).
  2. To tell lld to use this information, you will need to pass the /DEBUG:GHASH to lld.
Note that this feature is still considered highly experimental, so we're interested in your feedback (llvm-dev@ mailing list, direct email is ok too) and bug reports (bugs.llvm.org).  

Tuesday, September 19, 2017

Clang ♥ bash -- better auto completion is coming to bash



Compilers are complex pieces of software and have a multitude of command-line options to fine tune parameters. Clang is no exception: it has 447 command-line options. It’s nearly impossible to memorize all these options and their correct spellings, that's where shell completion can be very handy. When you type in the first few characters of a flag and hit tab, it will autocomplete the rest for you.

Background
However, such a autocompletion feature is not available yet, as there's no easy way to get a complete list of the options Clang supports. For example, bash doesn’t have any autocompletion support for Clang, and despite some shells like zsh having a script for command-line autocompletion, they use hard coded lists of command-line options, and are not automatically updated when a new option is added to Clang. These shells also can’t autocomplete arguments which some flags take (-std=[tab] for instance).


This is the problem we were working to solve during this year’s Google Summer of Code. We’re adding a feature to Clang so that we can implement a complete, exact command-line option completion which is highly portable for any shell. To start with, we'll provide a completion script for bash which uses this feature.

Implementation
Clang now has a new command line option called --autocomplete. This flag receives the incomplete user input from the shell and then queries the internal data structures of the current Clang binary, and returns a list of possible completions. With this API, we can always get an accurate list of options and values any time, on any newer versions of Clang.

We built an autocompletion using this in bash for the first implementation. You can find its source code here. Also, here is the sample for Qt text entry autocompletion to give an example how to use this API from an UI application as seen below:

final.gif

You can always complete one flag at a time. So if you want to use the API, you have to select the flag that the user is currently typing. Then just pass this flag to the --autocomplete flag in the selected clang binary. So in the case below all flags start with `-tr` are displayed with their descriptions behind them (separated from the flag with a tab character).
The API also supports completing the values of flags. If you have a flag for which value completion is supported, you can also provide an incomplete value behind the flag separated by a comma to get completion for this:
If you provide nothing after the comma, the list of the all possible values for this flag is displayed.

How to get it
This feature is available for use now with LLVM/clang 5.0 and we’ll also be adding this feature to the standard bash completion package. Make sure you have the latest clang version on your machine, and source this script. If want to make the change permanent, just source it from your .bashrc and enjoy typing your clang invocations!

Monday, September 11, 2017

2017 US LLVM Developers' Meeting Program

The LLVM Foundation is excited to announce the selected proposals for the 2017 US LLVM Developers' Meeting!

Keynotes:


Talks:


BoFs:


Tutorials:


Lightning Talks:


Student Research Competition:


Posters:

If you are interested in any of these talks, you should register to attend the 2017 US LLVM Developers' Meeting! Tickets are limited, so register now!

Friday, August 18, 2017

LLVM on Windows now supports PDB Debug Info

For several years, we’ve been hard at work on making clang a world class toolchain for developing software on Windows.  We’ve written about this several times in the past, and we’ve had full ABI compatibility (minus bugs) for some time. One area that been notoriously hard to achieve compatibility on has been debug information, but over the past 2 years we’ve made significant leaps.  If you just want the TL;DR, then here you go: If you’re using clang on Windows, you can now get PDB debug information!


Background: CodeView vs. PDB
CodeView is a debug information format invented by Microsoft in the mid 1980s. For various reasons, other debuggers developed an independent format called DWARF, which eventually became standardized and is now widely supported by many compilers and programming languages.  CodeView, like DWARF, defines a set of records that describe mappings between source lines and code addresses, as well as types and symbols that your program uses.  The debugger then uses this information to let you set breakpoints by function name, display the value of a variable, etc.  But CodeView is only somewhat documented, with the most recent official documentation being at least 20 years old.  While some records still have the format documented above, others have evolved, and entirely new records have been introduced that are not documented anywhere.


It’s important to understand though that CodeView is just a collection of records.  What happens when the user says “show me the value of Foo”?  The debugger has to find the record that describes Foo.  And now things start getting complicated.  What optimizations are enabled?  What version of the compiler was used?  (These could be important if there are certain ABI incompatibilities between different versions of the compiler, or as a hint when trying to reconstruct a backtrace in heavily optimized code, or if the stack has been smashed).  There are a billion other symbols in the program, how can we find the one named Foo without doing an exhaustive O(n) search?  How can we support incremental linking so that it doesn’t take a long time to re-generate debug info when only a small amount of code has actually changed?  How can we save space by de-duplicating strings that are used repeatedly?  Enter PDB.


PDB (Program Database) is, as you might have guessed from the name, a database.  It contains CodeView but it also contains many other things that allow indexing of the CodeView records in various ways.  This allows for fast lookups of types and symbols by name or address, the philosophical equivalent of “tables” for individual input files, and various other things that are mostly invisible to you as a user but largely responsible for making the debugging experience on Windows so great.  But there’s a problem: While CodeView is at least kind-of documented, PDB is completely undocumented.  And it’s highly non-trivial.


We’re Stuck (Or Are We?)
Several years ago, we decided that the path forward was to abandon any hope of emitting CodeView and PDB, and instead focus on two things:
  1. Make clang-cl emit DWARF debug information on Windows
  2. Port LLDB to Windows and teach it about the Windows ABI, which would be significantly easier than teaching Visual Studio and/or WinDbg to be able to interpret DWARF (assuming this is even possible at all, given that everything would have to be done strictly through the Visual Studio / WinDbg extensibility model)
In fact, I even wrote another blog post about this very topic a little over 2 years ago.  So I got it to work, and I eventually got parts of LLDB working on Windows for simple debugging scenarios.


Unfortunately, it was beginning to become clear that we really needed PDB.  Our goal has always been to create as little friction as possible for developers who are embedded in the Windows ecosystem.  Tools like Windows Performance Analyzer and vTune are very powerful and standard tools in engineers’ existing repertoires.  Organizations already have infrastructure in place to archive PDB files, and collect & analyze crash dumps.  Debugging with PDB is extremely responsive given that the debugger does not have to index symbols upon startup, since the indices are built into the file format.  And last but not least, tools such as WinDbg are already great for post-mortem debugging, and frankly many (perhaps even most) Windows developers will only give up the Visual Studio debugger when it is pried from their cold dead hands.


I got some odd stares (to put it lightly) when I suggested that we just ask Microsoft if they would help us out.  But ultimately we did, and… they agreed!  This came in the form of some code uploaded to the Microsoft Github repo which we were on our own to figure out.  Although they were only able to upload a subset of their PDB code (meaning we had to do a lot of guessing and exploration, and the code didn’t compile either since half of it was missing), it filled in enough blanks that we were able to do the rest.


After about a year and a half of studying this code, hacking away, studying the code some more, hacking away some more, etc, I’m proud to say that lld (the LLVM linker) can finally emit working PDBs.  All the basics like setting breakpoints by line, or by name, or viewing variables, or searching for symbols or types, everything works (minus bugs, of course).


For those of you who are interested in digging into the internals of a PDB, we also have been developing a tool for expressly this purpose.  It’s called llvm-pdbutil and is the spiritual counterpart to Microsoft’s own cvdump utility.  It can dump the internals of a PDB, convert a PDB to yaml and vice versa, find differences between two PDBs, and much more.  Brief documentation for llvm-pdbutil is here, and a detailed description of the PDB file format internals are here, consisting of everything we’ve learned over the past 2 years (still a work in progress, as I have to divide my time between writing the documentation and actually making PDBs work).


Bring on the Bugs!
So this is where you come in.  We’ve tested simple debugging scenarios with our PDBs, but we still consider this alpha in terms of debug info quality.  We’d love for you to try it out and report issues on our bug tracker.  To get you started, download the latest snapshot of clang for Windows.  Here are two simple ways to test out this new functionality:
  1. Have clang-cl invoke lld automatically
    1. clang-cl -fuse-ld=lld -Z7 -MTd hello.cpp
  2. Invoke clang-cl and lld separately.
    1. clang-cl -c -Z7 -MTd -o hello.obj hello.cpp
    2. lld-link -debug hello.obj
We look forward to the onslaught of bug reports!


We would like to extend a very sincere and deep thanks to Microsoft for their help in getting the code uploaded to the github repository, as we would never have gotten this far without it.


And to leave you with something to get you even more excited for the future, it's worth reiterating that all of this is done without a dependency on any windows specific api, dll, or library.  It's 100% portable.  Do I hear cross-compilation?

Zach Turner (on behalf of the the LLVM Windows Team)