Tuesday, July 12, 2011

Location information

One of the hardest parts about writing a compiler is producing correct location information. Well, it is very easy to get the precise location of a token (most of the time, in any case), but the hard part is assigning token locations to actual AST productions, since they are composed of several tokens. Take a simple function definition: void foo(int bar) {}. Nearly every token in that definition can be argued to be the right location of the function. Hence showing only line information is common for compilers: in most normal usage, you get the same information no matter which token you pick. However, there are times when the actual column number is crucial: if you need to correlate something to the original source code (e.g., for rewriting or for DXR), you want that column number to be spot on.

Now let's consider pathological cases. Suppose we define that function in a templated class. Do I want the location of the template or the location of the template instantiation? The former is often better, but there are times you want the latter. But you can also make the function in a macro: #define FUNCTION(name) void name(int bar) {} and FUNCTION(foo). Do I want the location within the macro definition or the location in the macro instantiation? Macros let you be really evil: #define FUNCTION(name) void start##name(int bar) {}. Now which location do I want? Oh yeah, and while we're on the topic, there is also the issue of autogenerated code (e.g., I want the location to be what it was in the original file). So which location do I use?

There are two orthogonal issues here: which token do I use in the "final" version of the text, and which version of the text do I grab it from. There are several versions of the text (think of the lexer not as a single unit but instead of a chain of steps that pass the information between them); we don't care about some of the intermediate representations (e.g., what the code looks like after replacing trigraphs). The lowest level of these is the position in the last form of text, just before being parsed for real. It's also important to track the position through repeated invocations of macros (most of the time). We also want the location in the text file as it is when fed to the compiler to begin with, as well as the position in the original file before it is fed to whatever output the C/C++ source file.

Clang gives us some of this information for any SourceLocation. The source of the actual token is the spelling location. On the other hand, the instantiation location is the location of the outermost macro invocation to generate the production. It is possible to get the tree of locations by repeatedly calling getImmediateSpellingLocation, which only looks down one level of invocation. To get the result after following #line directives, you need to use the presumed location, which looks at the instantiation location. It is possible to pass in the spelling location to getPresumedLoc to produce the correct results, though. Unfortunately, precise location information goes out the window if your macro uses # or ## (it goes into "scratch space" whose line numbers don't correlate well enough to the original source code for me to make sense of).

The other direction is one of the possible methods to get the source location in the class. A function declaration merely has 6 of these: the start and end of the declaration, the location of the name identifier, the location of the type specifier, the location after the template stuff, and the location of the right brace. I would go in more detail, but I haven't yet catalogued all of the ways to get this information to be sure that the method names actually mean what they claim to mean.

No comments: