LLVM Project News and Details from the Trenches

Wednesday, April 14, 2010

Extensible Metadata in LLVM IR

A common request by front-end authors is to be able to add some sort of metadata to LLVM IR. This metadata could be used to influence language-specific optimization passes (for example, Type Based Alias Analysis in C), tag information for a custom code generator, or pass through information to link time optimization. LLVM 2.7 provides first-class support for this, and has switched debug information over to use it (improving debug info!).

While the details of this feature can be found in the LLVM Language Reference manual, sometimes it is hard to distill the big picture from the low-level details. This post tries to fill the gap by explaining some history, motivation and example use cases for this new LLVM 2.7 feature.

This post was written by Devang Patel and myself.

Before we dive into how metadata works, it is useful to describe how debug information was represented in LLVM 2.6 and earlier:

Debug Information in LLVM 2.6

Debugging information communicates source location information, type information and variable information to the debugger. This information is not used during the execution of program and does not result in executable code in the object file, but the code generator uses it to produce DWARF information. In this way, debug information is a sort of side channel from the front-end to the DWARF emitter in the code generator.

For lack of a better mechanism, in LLVM 2.6 and earlier, debugging information is encoded using global variables tagged with a special "llvm.metadata" section. For example, we would generate something like this to describe C code like "int my_data;":

@my_data = common global i32 0, align 4

@llvm.dbg.global_variable = internal constant %llvm.dbg.global_variable.type { 
  i32 458804, 
  {}* bitcast (%llvm.dbg.anchor.type* @llvm.dbg.global_variables to { }*), 
  {}* bitcast (%llvm.dbg.compile_unit.type* @llvm.dbg.compile_unit to { }*), 
  i8* getelementptr ([8 x i8]* @.str4, i32 0, i32 0), 
  i8* getelementptr ([8 x i8]* @.str4, i32 0, i32 0), 
  i8* null, 
  {}* bitcast (%llvm.dbg.compile_unit.type* @llvm.dbg.compile_unit to { }*), 
  i32 2, 
  {}* bitcast (%llvm.dbg.basictype.type* @llvm.dbg.basictype to { }*), 
  i1 false, 
  i1 true, 
  {}* bitcast (i32* @my_data to {}*)
}, section "llvm.metadata"

@.str4 = internal constant [8 x i8] c"my_data\00", section "llvm.metadata"

In this example, @my_data is the actual global variable generated for the C variable. This is what gets generated regardless of whether debug info is enabled or not.

Here, @.str4 and @llvm.dbg.global_variable were interpreted by the code generator as descriptors holding information about @my_data, with various fields indicating the line number the global is declared, the compile unit, etc. You can see a full description of these fields in the LLVM2.6 Debug Info docs. At code generation time, the dwarf writer would walk this and convert it to DWARF information. The LLVM globals would not get emitted as normal code because they are in the magic llvm.metadata section.

While this did provide a basic level of functionality, it had a number of drawbacks. First, the @my_data global variable has an extra use in the IR, which may influence optimization of @my_data. For example the dead global elimination pass wouldn't delete it if dead, and the mod/ref analysis pass wouldn't analyze it because it appears that its address is taken. One major goal of debug information is that turning it on should not affect the executable code generated by the compiler. If it did, turning on debug info could hide the bug you're trying to track down!

A second drawback of this implementation is that it has lots of pointless bitcast constant expressions. These extra objects bloat memory footprint, take time to allocate, unique and optimize, etc. The bitcasts also negatively impact readability of LLVM intermediate code, and are completely unnecessary: the dwarf emitter doesn't care about the types, it is walking this information as a data structure, not emitting it to memory.

Motivation for LLVM IR Metadata

Based on our experiences with debug info, and a desire to implement new cool things, we designed and implemented a brave new world where metadata was actually a first-class part of LLVM IR. The design aims to solve the issues mentioned above:

  1. Optimizations shouldn't be affected by metadata unless they explicitly try to look at it.
  2. We want to reduce the memory footprint and cost of debug info.
  3. Metadata shouldn't have LLVM IR types.
  4. Ideally, the syntactic clutter should be reduced, improving the odds that someone can decode this stuff.

Another important design point is that we want to be able to add new forms of metadata without the optimizers having to be updated to support them. This is a critical design point, because we want the metadata to be extensible by front-end authors to do whatever they want, and shouldn't require hacking the optimizers.

Metadata in LLVM 2.7

Metadata support includes several different related IR extensions: a new 'metadata' type in LLVM IR, new MDString, MDNode, and NamedMDNode classes (all three of which derive from 'Value'), added support for referencing metadata from intrinsics, and support for attaching it to instructions. Metadata support is generally in the llvm/Metadata.h header. We'll walk through each of these constructs in turn:

The new 'metadata' type is the LLVM Type of each new IR object. This ensures that you can't use metadata as operands to random instructions, for example, you can't do 'add i32 4, !"str"' since metadata is not a first-class type. The restrictions on metadata mean that it can only appear as an argument to an intrinsic, as an operand to another metadata, at top-level in a module (NamedMDNode), or be attached to an instruction.

The new MDString class is used to represent string data in metadata, and it always has a metadata type. Since MDStrings are meant as metadata, not code, they are not null terminated in the .ll file. The MDString class allows C++ code walking the IR to access the arbitrary string data with a StringRef. In the .ll file, its syntax is something like:


The new MDNode class is a tuple that can reference arbitrary LLVM IR values in the program as well as other metadata. In the .ll file, MDNodes are numbered and the syntax for referring to one is "!123" where 123 is the number of the node being referenced. An MDNode is declared with something like:

!23 = !{ i32 4, !"foo", i32 *@G, metadata !22 }

In this case, the MDNode has four operands, the first is a ConstantInt, the second is a MDString, the third is a global variable, the fourth is another MDNode. MDNode's come in two flavors: one is a normal global MDNode which references global variables, constants etc. The second is a function-local MDNode, which can (potentially transitively) refer to instructions within a particular function. One important aspect of MDNodes is that they are not considered to be "uses" of a value: for example, they won't be found with use_iterator and aren't counted for predicates like Value::hasOneUse(). This prevents metadata from accidentally affecting code generation.

The new NamedMDNode class provides named access to metadata at a module level, and each NamedMDNode contains a list of MDNode's. This gives clients of metadata (e.g. debug info) the ability to find all the metadata of a particular form (e.g. global variable debug descriptors). The Module class maintains a list of NamedMDNode instances just like it does global variables, functions, and aliases. In the .ll file, a NamedMDNode looks like this:

!my_named_mdnode = !{ !1, !2, !4212 }

This defines a NamedMDNode with three referenced MDNodes.

LLVM intrinsics may reference metadata as normal operands. More specifically, they can directly reference MDNode and MDString objects even though other calls and other operations cannot. In .ll files, this looks something like:

!0 = metadata !{i32 524544, ...

  %x = alloca i32
  call void @llvm.dbg.declare(metadata !{i32* %x}, metadata !0)

This passes the module-level !0 MDNode into the second argument and passes a function-local MDNode as the first argument (which, since it is an mdnode, does not count as a use of %X). In this case, the code generator uses this information to know that the metadata !0 is the variable descriptor for the alloca %X. Note that intrinsics themselves are not considered metadata, so they can affect code generation etc.

Finally, metadata can be attached to instructions. Instructions can have an arbitrary list of MDNodes attached to them with string tags. For example:

store i32 0, i32* %P, !nontemporal !2, !frobnatz !123
  ret void, !dbg !9

The first case is a store with two instruction-level metadata records attached to it, one named 'nontemporal' (which is implemented in LLVM 2.7) and one named 'frobnatz' (which is a great new feature that might be in LLVM 2.8). The second is a return instruction with a debug location attached to it.

Using Extensible Metadata for Debug Info

To contrast with the LLVM 2.6 debug info example above, in LLVM 2.7 we get something like this:

@my_data = common global i32 0, align 4
  !llvm.dbg.gv = !{!0}

  !0 = metadata !{
     i32 524340, i32 0, metadata !1, metadata !"my_data", metadata !"my_data",
     metadata !"", metadata !1, i32 2, metadata !3,
     i1 false, i1 true, i32* @my_data

This replaces the global variables with an MDNode and MDString. This shrinks the IR by eliminating the pointless bitcasts, eliminating the irrelevant IR types, and the use of @my_data by !0 is no longer considered a "use". However, we still have lots of magic fields that are documented elsewhere.

If you'd like to see more examples of debug info, you can see what the frontend generates by using something like "clang foo.c -g -S -o - -emit-llvm | less".

What to use Metadata for

A subtle point that was touched on above is that we don't want the optimizers to have to know about metadata. While it is very feasible to make optimizations preserve specific metadata (e.g. loop strength reduction could do some sort of fancy thing to update debug info it if wanted) by default, optimizations ignore and destroy it. For example, if an optimization deletes an instruction and there is a function-level MDNode referencing it, the reference in the MDNode will implicitly drop to null.

This has some important implications on what it is safe to use metadata for: it can only be used for "value add" information, information that does not change the semantics of the program. To repeat this important point, use of metadata is only safe if the program retains its semantics when the metadata is silently dropped.

For example, it is trivially safe for debug information to use metadata (though the dwarf emitter has to be careful to tolerate null pointers!): if metadata is dropped, it just means that debug information quality is reduced, it doesn't invalidate the debug info itself. In our example above, if the global "my_data" is deleted by the optimizer, the reference will drop to null and the debug info emitter won't generate a location for my_data.

While this may sound limiting, there are lots of potential uses cases for metadata, you just have to be careful how you structure it. Lets run through a few examples:

Current and Potential Clients of Metadata

LLVM 2.7 supports generating non-temporal loads and stores using the !nontemporal instruction-level modifier as documented in the LangRef manual. A non-temporal access is normal access with a hint to the CPU that it can avoid pulling data into the cache, as it won't be accessed again recently. This is safe because !nontemporal is an optimization hint: dropping the !nontemporal hint will result in the optimizer producing a normal load and store, which may have lower performance, but provides the same semantics as an actual non-temporal access.

A potential future use case is to support Type-Based Alias Analysis (TBAA). TBAA is an optimization to know that "float *P1" and "int *P2" can never alias (in GCC, this is enabled with -fstrict-aliasing). The trick with this is that it isn't safe to implement TBAA in terms of LLVM IR types, you really need to be able to encode and express a type-subset graph according to the complex source-level rules (e.g. in C, "char*" can alias anything).

An LLVM implementation of TBAA would encode the type-subset graph with MDNodes, and add type tags to load and store operations with a !tbaa instruction tag. A new AliasAnalysis implementation would look for these tags on accesses and walk the type subset graph to determine if the two accesses might alias each other. This use of metadata is also safe, because it is an optimization: if the type tag gets dropped, it is always safe to assume that the access aliases everything for TBAA purposes.

More broadly, metadata is a great way for a front-end to communicate arbitrary information to custom language-specific optimization passes. TBAA is one example, but this could equally apply to things like devirtualization (through class hierarchy analysis), doing locking and exception handling optimizations, even library-centric optimizations could be implemented with this.

Since LLVM 2.7 is only the first release that supports metadata in its IR, we have yet to see how it will ultimately get used. If you end up using it in a novel or interesting way, please send me a link describing your use and I'll link to it from this post.

-Chris and Devang