New passes in clang-tidy to detect (some) Trojan Source

Trojan Source

The original Trojan Source paper encompasses a family of attacks that rely on Unicode properties to make code look different from how a compiler processes it. For instance the following code taken from the paper:

#include <stdio.h>
#include <stdbool.h>

int main() {
  bool isAdmin = false;
  /*‮ } ⁦if (isAdmin)⁩ ⁦begin admins only */
  printf("You are an admin.\n");
  /* end admins only ‮ { ⁦*/
  return 0;
}

looks like there is a guard on isAdmin while the compiler actually reads the following byte stream

/* <U+0x202E> } <U+0x2066> if (isAdmin) <U+0x2069> <U+0x2066> begin admins only */

This issue got submitted before the official release to the LLVM Security group, and while we agreed this was more of a display issue than an actual compiler-related issue, we also agreed having a clang-tidy check for each flaws described in the paper could not hurt.

Using clang-tidy

The tool named clang-tidy can run a bunch of extra passes on a codebase, detecting coding convention issues, API misuses, security flaws etc. We have been adding three new checkers:

Detecting misleading bidirectional characters

The new check misc-misleading-bidirectional parses each comment and string literal from the codebase and looks for unterminated bidirectional sequence, i.e. sequence that leak past the end of comment or string literal, making regular code being displayed right-to-left instead of the usual left-to-right. In the case of the example above we get a warning close to:

5:3: warning: comment contains misleading bidirectional Unicode characters [misc-misleading-bidirectional]

Detecting misleading identifiers

C++ allows for some Unicode codepoints within identifiers, including identifiers that have a strong right-to-left direction, which can lead to misleading statements. For instance in the following,

int א = ג;

Are we assigining to א or to ג? We are actually doing the latter, and that is confusing. The pass misc-misleading-identifier detect that configuration and outputs a warning similar to

10:3: warning: identifier has right-to-left codepoints [misc-misleading-identifier]

Detecting confusing identifiers

Who never received a spam using unicode characters that look alike ascii characters to bypass some hypothetical anti-spam scanning? C like language do not escape the trend, and it is perfeclty valid and confusing to define

int foo;

at some point of the program, and

int 𝐟oo;

elsewhere. The misc-homoglyph checker detects such confusable identifiers (a.k.a. homoglyph) based on a list of confusables.txt maintained by the Unicode consortium. In the case above, one would get a warning similar to

7:5: warning: 𝐟oo is confusable with foo [misc-homoglyph]

Concluding Words

As described in this post, we chose to implement Trojan Source counter-measure as several clang-tidy checkers. Doing so instead of implementing them as Clang warning is a trade-off on parse time.

The interested reader can discover the alternative GCC aproach in this dedicated blog post!