Using Flex and a Bison GLR parser in the Facebook Hackercup

I thought the following problem statement sounded like an interesting parsing problem. What makes it interesting is that you need to generate a parser of a perhaps less-well known type, namely a GLR-parser (“generalized left-to-right parser”). Of course this isn’t as efficient as the intended solution, but maybe it will come in handy some day, in a situation that can’t be solved with a simple algorithm.

I’ll assume that you’re familiar with the basics of lex/flex and yacc/bison, as well as the general concepts around lexical analysis, grammars and parsers.

Here’s an excerpt of the problem statement:

Balanced Smileys

Your friend John uses a lot of emoticons when you talk to him on Messenger. In addition to being a person who likes to express himself through emoticons, he hates unbalanced parenthesis so much that it makes him go 🙁

Sometimes he puts emoticons within parentheses, and you find it hard to tell if a parenthesis really is a parenthesis or part of an emoticon.

A message has balanced parentheses if it consists of one of the following:

  1. An empty string “”
  2. One or more of the following characters: ‘a’ to ‘z’, ‘ ‘ (a space) or ‘:’ (a colon)
  3. An open parenthesis ‘(‘, followed by a message with balanced parentheses, followed by a close parenthesis ‘)’.
  4. A message with balanced parentheses followed by another message with balanced parentheses.
  5. A smiley face “:)” or a frowny face “:(“

Write a program that determines if there is a way to interpret his message while leaving the parentheses balanced.

[…]

Example input:

5
:((
i am sick today (:()
(:)
hacker cup: started :):)
)(

[Source: Facebook hackercup qualification round, problem 2]

Basically, we need to look at a string, pick out tokens, and then determine whether or not it conforms to the rules. Sounds easy, right?

First, we need to write some regular expressions to tokenise the strings. The rules in the problem statement tell you reasonably well what tokens you need – the only trick is to make the colon as well as the left (open) and right (close) brackets separate tokens. That way you can let the parser generated by Bison determine what should be a smiley and what not.

Here are my regular expression to token mappings for the lexical analyser:

[0-9]+   { return NUMBER; }
[a-z ]+  { return LETTERS; }
"("      { return OPEN_PAREN; }
")"      { return CLOSE_PAREN; }
[\r\n]+  { return EOL; }
":"      { return COLON; }
[ \t]+   { }
<<EOF>>  { return 0; }

[extract from lex.l]

As you can see, this part is pretty straightforward. I could’ve ignored the numbers, as they shouldn’t appear in the message anyway.

Now you need to define the allowed grammar. Here’s what I came up with. There’s probably some room for improvement, but it was a competition 🙂

line: messages EOL  {  }
      | EOL

/* atomics */
text : LETTERS
     | COLON

smile : COLON CLOSE_PAREN

frown : COLON OPEN_PAREN

messages: balanced_message
      | messages balanced_message

balanced_message: OPEN_PAREN messages CLOSE_PAREN
      | OPEN_PAREN CLOSE_PAREN /* empty */
      | smile
      | frown
      | text

[extract from grammar.y]

These grammar rules follow mostly from the rules in the problem statement:

  • The first statement says, that a ‘line’ can either consist of some messages, or an empty string (rule #1)
  • I used an intermediate rule called ‘text’ to group letters and colons (used in rule #2)
  • Next we define what smilies look like.
  • The ‘messages’ rule basically just says that a message can consist of one or more balanced messages. (rule #4).
  • The last rule defines what a balanced message actually looks like. It combines rules #2, #3 and #5.

Now, the twist in this problem is that we need to check “if there is a way to interpret his message while leaving the parentheses balanced”. With ‘a way’, what Facebook really means is ‘any way’. The problem with this is that a given input string may not parse uniquely. The following is a good example:

:( :)

Do you see two smilies, or do you see some text with some more text in brackets? (remember that a colon is classified as text)

Bison-generated parsers normally really don’t like this: imagine the parser walking through the double smiley example. With a normal LR(1) parser, the following would happen: after reading the first colon, we can either shift that colon onto the stack, or reduce it to text. We don’t know what the right solution is, so we look ahead 1 character, and find a ‘(‘. Now we have even more options, as we can reduce the ‘:(‘ to a ‘frown’ smiley, or reduce the colon to ‘text’ and shift the ‘(‘ onto the stack, hoping for a closing bracket.

With a GLR parser, Bison will keep track of all of these options. Here’s the idea:

A simple solution to this problem is to declare the parser to use the GLR algorithm. When the GLR parser reaches the critical state, it merely splits into two branches and pursues both syntax rules simultaneously. Sooner or later, one of them runs into a parsing error. […] . So one of the branches fails silently, and the other one continues normally, performing all the intermediate actions that were postponed during the split.

[Quote from the Bison manual page on the GLR-parser]

The nice thing is, that using the GLR parser merely requires you to specify the appropriate option, ‘%glr-parser’. It’s probably worthwhile saying that you of course can’t expect the same performance characteristics from a GLR parser as from a LR(1) parser (see the man page for details).

Anyway, with the GLR-parser option my Bison-generated parser is capable of actually parsing the text, and correctly determining whether or not the input conforms to the grammar. The only remaining problem is, that Bison doesn’t know which of two possible interpretations to choose, and returns an error, stating that it detected an ambiguity. For the problem at hand that’s not actually a problem, since we only care whether or not there exists a valid interpretation (i.e. one that parses correctly), and it doesn’t matter if there’s more than one valid possible interpretation. I “fixed” this by making the error function tell the rest of the code whether there was a real error, or an ambiguity error. If there was an ambiguity error, we could ignore it, and pass the validation test.

Now there’s just some input/output processing code left to write. If you turn on yydebug, here’s what you get for the above example:

% ./solver
1
:( :)
Ambiguity detected.
Option 1,
  messages -> 
    messages -> 
      balanced_message -> 
        text -> 
          COLON 
    balanced_message -> 
      OPEN_PAREN 
      messages -> 
        messages -> 
          balanced_message -> 
            text -> 
              LETTERS 
        balanced_message -> 
          text -> 
            COLON 
      CLOSE_PAREN 

Option 2,
  messages -> 
    messages -> 
      messages -> 
        balanced_message -> 
          frown -> 
            COLON 
            OPEN_PAREN 
      balanced_message -> 
        text -> 
          LETTERS 
    balanced_message -> 
      smile -> 
        COLON 
        CLOSE_PAREN 

Case #1: YES

Result

Overall, the whole solution was 148 lines (including comments & whitespace), and I liked the fact that it was a pretty straightforward implementation: the bulk of the intelligence sits in the syntax definition and the grammar rules, which follow from the problem statement.

If you’re interested in the full code, it’s on github! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.