Parsing String Interpolations with ANTLR4


ANTLR is probably the best-known parser generator out there, and for good reason. It’s a powerful LALR generator that also generates listener and visitor classes, in a multitude of programming languages. Combine this with ANTLR4’s special features, like lexer modes, and lexer/parser actions, and building DSL’s and compilers is a lot faster, because the number of hours spent writing parser boilerplate is minimized.

Many languages include supports for string interpolations, where expressions are embedded within string literals, which ultimately is a fancy sugar over string concatenation. Check out this example from Dart:

	var myStr = "Hello, $world! The time is ${new DateTime.now()}.";    

There are several tokens here, just between the ":

  • " – Opens a string literal. Here, we change the context of the lexer, because tokens within a string literal don’t look like the rest of the program.
  • Hello, – A regular part of a string. It’s plain text.
  • $world – A special interpolation that inserts the value of a specific symbol at runtime, in this case world.
  • ! The time is – Another plaintext part.
  • ${ – In Dart strings, you can wrap expressions within ${} inside of a string literal. We change the lexer’s context again here, so that we can temporarily scan regular tokens, before going back to string mode after reaching }.
  • new DateTime.now() – These are multiple tokens. In our parser, we’d probably match these to a rule named expr, or something similar.
  • } – Ends the string interpolation. Note: By the time the scanner reaches this, we’re in our default, non-string mode, so the logic for handling } also has to exist in the default mode.
  • . – More plain text.
  • " – Ends the string literal. Returns the lexer mode to whatever it was previously.

If you’ve ever written a templating language, or another language with so-called “island grammars,” then you’re probably familiar with lexers having to change which sorts of patterns they recognize, depending on where they are in the program. For example, PHP files are scanned as HTML, but once <?php is encountered, text is scanned as PHP code.

In the same manner, ANTLR supports lexer modes. ANTLR lexers use a LIFO stack for handling mode changes, so they can be used even for nesting syntax from different island grammars within each other. For example, the following is valid Dart:

  var myFancierStr = "${"${"${"inception"}"}"}";  

By using the pushMode and popMode lexer actions, we can achieve the same thing:

  // SomeLexer.g4// We're in the default mode; define our program tokensWS: [ \n\r\t]+ -> skip;CURLY_R: '}' -> popMode; // When we see this, revert to the previous context.OPEN_STRING: '"' -> pushMode(STRING); // Switch contextID: [A-Za-z_][A-Za-z0-9]*;// Define rules on how tokens are recognized within a string.// Note that complex escapes, like Unicode, are not illustrated here.mode STRING;ENTER_EXPR_INTERP: '$(' -> pushMode(DEFAULT_MODE); // When we see this, start parsing program tokens.ID_INTERP: '$'[A-Za-z_][A-Za-z0-9_]*;ESCAPED_DOLLAR: '\\$';ESCAPED_QUOTE: '\\"';TEXT: ~('$'|'\n'|'"')+; // This doesn't cover escapes, FYI.CLOSE_STRING: '"' -> popMode; // Revert to the previous mode; our string is closed.  

Note: Even though " is the same symbol, there are separate rules for it within the DEFAULT_MODE and STRING mode. It has to be this way, or else the lexer will mistake the quotation mark for a TEXT token.

In your parser, you’ll probably have a string rule; it won’t handle the string as a large blob, but as a set of parts:

  parser grammar SomeParser;string: ENTER_STRING stringPart* CLOSE_STRING;stringPart:  TEXT #TextStringPart  | ID_INTERP #IdInterpPart  | ENTER_EXPR_INTERP expr CURLY_R #ExprInterpPart;  

Note: ENTER_EXPR_INTERP and CURLY_R exist in different lexer modes, but are combined into the same rule. Without telling the parser that it should expect a CURLY_R, it won’t look for one, and therefore an expression interpolation will never be terminated, and create a syntax error.

The other thing to look out for is that whenever the lexer encounters } in non-string mode, it will pop the lexer’s mode. You probably don’t always want this, especially in a C-style language, where { and } always appear together.

The fix is simple – when the lexer encounters a left brace, push the DEFAULT_MODE. This way, if you were already in the default mode, you’ll remain in it after scanning the next right brace.


Thanks for reading! Hope it helps someone out there.