ARM Advanced SIMD (NEON) Intrinsics and Types in LLVM

LLVM now supports all the intrinsic functions defined by ARM for the Advanced SIMD (aka "NEON") instruction set, but if you are migrating from GCC to LLVM, there are some implementation differences that you may encounter. LLVM follows ARM's specification of the standard NEON types more closely than GCC. It is also more strict about checking types of arguments to the NEON intrinsics.

This post describes the NEON-related differences between LLVM and GCC and gives a few examples of how to adapt your code to work with LLVM.

Background

NEON is a vector processing extension to the ARM architecture. It is included in most recent ARM processors such as the Cortex A8 and A9. Some of the NEON instructions perform operations that are not simple to specify in C or C++, so ARM has defined a standard set of intrinsic functions for those operations. For example, the vqadd_s16 intrinsic performs a saturating add of two 64-bit vectors with elements that are 16-bit signed integers. ARM has also defined a standard set of NEON vector types to be used with these intrinsics. For example, the arguments and return value of the vqadd_s16 intrinsic have a type of int16x4_t. These intrinsics and types are declared in the <arm_neon.h> header file, which is provided by the compiler.

There are at least two prior implementations of these NEON intrinsics. ARM's RealView Compilation Tools (RVCT) compiler provides the full set of them, and not surprisingly, RVCT adheres closely to ARM's specification. GCC also has an implementation of NEON intrinsics, but it differs in some ways from RVCT and ARM's specification (at least in the 4.2.1 version from which llvm-gcc is derived).

The current status of NEON intrinsics in LLVM is that llvm-gcc has full support for them, although there is undoubtedly room for further performance tuning. Clang does not yet support NEON. Patches are welcome!

Different Types

ARM defines the NEON vector types as opaque "containerized vectors". These types are defined in <arm_neon.h> as C structures. The user-visible type names are typedefs to these internal structures. For example, the type for a vector of 4 floats is defined as:

  typedef struct __simd128_float32_t float32x4_t;

The content of the internal structures is not specified, so the only thing you can portably do with values of these types is to pass them to NEON intrinsics.

GCC has its own syntax for specifying vector types. This syntax is not specific to NEON. A vector type is defined by adding a "__vector_size__" attribute with the total vector size in bytes. For example:

  typedef float float32x4_t __attribute__ ((__vector_size__ (16)));

Instead of using the opaque containerized vectors, GCC's implementation of the standard NEON types uses its own vector syntax.

So what about LLVM? We compromise and do both! The NEON types are defined as structures in <arm_neon.h>, following ARM's specification, but the contents of those structures are vectors defined with GCC's syntax. Each internal structure has a single element named "val" with a GCC vector type. The GCC vector types are defined with a "__neon_" prefix to the standard NEON type name. So, if you want to access the GCC vector type directly, you can do it with LLVM. That code won't be portable — it won't work with RVCT — but it may ease the transition from GCC.

What are the implications of this difference in the NEON types? The main thing is that the LLVM NEON types are aggregates, not scalars, so you can't do things like casting them to integer types. You also can't assign NEON variables to specific NEON registers using "asm" register attributes, since that is not supported for aggregates. See below for some related differences in the way you initialize vectors.

Stricter Type Checking

Arguments to LLVM's NEON intrinsics are subject to much stricter type checking than with GCC. GCC's vector types can be cast to other vector types as long as the total size remains the same and you don't mix integer and floating-point vectors. The arguments to GCC's NEON intrinsics get this same treatment. You can pass a uint8x8_t value for a int32x2_t argument and GCC will not even warn you. LLVM requires the argument types to match exactly. If your code is sloppy with vector types, you will have to clean it up to compile with LLVM.

If you really want to cast NEON vector types, the right way to do it is with the vreinterpret intrinsics. For example, vreinterpret_s32_u8 will perform the cast from uint8x8_t to int32x2_t that was mentioned above.

How to Initialize a Vector?

GCC's vector types can be directly assigned a brace-enclosed list of values corresponding to the vector elements. For example:

  int32x2_t vec = { 1, 2 };

initializes a vector with element values 1 and 2. This is convenient but not portable. In general, the best way to assign a vector value is to load it from memory with a NEON intrinsic. That is completely portable and often as fast or faster than the alternatives.

There are some special cases where you can do better. If all the vector elements have the same value, one of the vdup intrinsics would be a good solution. You can use the vcreate intrinsics to construct vectors from 64-bit values, and the vcombine intrinsics can put two of them together to form a 128-bit vector. However, moving values from general-purpose ARM registers to the NEON register file can be quite slow, so this may not be faster than a load. If the vector elements are floating-point values, then they are likely already in the right register file, and using vset_lane intrinsics to put them together into a vector may be faster. Generating the fastest code for these different cases is a work in progress, so you may need to experiment with different approaches to see which is fastest.