2 — Parsing

Friday, 10 January 2020

You might be wondering how information gets turned into a data representation.

In Fundamentals I, most of your programs use big-bang to turn information such as clock ticks, keystrokes, or mouse events into data (and the to-draw clauses turns data into information). But this is all pretty shallow information. For programming languages, we need more.

Recall from Fundamentals I that batch programs always have the following shape:
#lang racket
; EFFECT read from STDIN, compute, write to STDOUT
(define (min)
And interpreters are batch programs.

In a sense, the "parsing" problem was solved in the 1960s and 1970s with the complete treatment of context-free grammars and automatic translations of such grammars into parsers, i.e., functions that map textually represented information to internal data. In another sense, it is still an open problem and you can still find papers in recent research conferences on the topic of generating good error messages with automatically generated parsers. Here is one of looking at this simple program:
  1. Information has a concrete representation, namely JSON.

  2. Reading transforms information into its data representation.

  3. Writing renders a data representation as information.

In the context of programming languages, step 2 is called parsing. Technically, the transformation consists of two steps:
  • lexing or tokenization, and

  • parsing proper

From the perspective of understanding the Principles Of Programming Languages, parsing is boring. So we cover its very essence and leave the details to Compilers where it belongs.

Why Parse?

Imagine our an extension of the AE language that comes with commands, say, launchNuke which, as its name says, launches a nuke and then returns 0. If the programmer writes the following,


an interpreter gets to launchNuke before it notices the ill-shaped expression.

Therefore we parse first.

What is Parsing

In essence, a parser is a program that determines whether a given piece of data represents a program in a subject language.

Here is the JSON subset we wish to use as our language of arithmetic expressions:

    AEJ = Integer || [AEJ,"+",AEJ] || [AEJ,"+",AEJ]

If our AEJ programmer writes correct programs, we represent them as following form of data in Racket:
(struct node [op left right] #:transparent)
; AE  = Integer || (node + AE AE) || (node * AE AE)
In the context of programming languages, a data representation for programs is generally called a abstract syntax tree or just AST.

While we can leave the work of recognizing JSON from other forms of information, AEJ programmers may use all of JSON to make mistakes, so we need to know how JSON is represented in Racket:

; JSexpr   = Boolean | Number | String | [Listof JSexpr] | Hash

It is irrelevant how JSON information is translated into this JSON data representation; for our exercise in "essential parsing" we can use this representation.

So in this context, a parser has this signature:
; JSexpr -> Boolean
; determine whether this JSexpr represents an AE

What is Parsing, Again

Your IDEs use the parser of the subject language to color the syntax, highlight violations of the grammar rules, and even for name completion. Returning a Boolean won’t do.

What we really want is a piece of data that is either a representation of a subject programs or similar to one but contains a marker where the parser failed. Since parsing is a boring Compilers subject, we use plain Strings to indicate where errors occur because (1) they do mot occur inside of AE expressions and (2) they can carry information that an IDE could use.

; AE  = Integer || (node + AE AE) || (node * AE AE)
; AE-error = Integer || (node + AE-error AE-error) || (node * AE-error AE-error)
;          || String

Okay so here is the real signature of a parser:

; JSExpr -> AE-error

How to Parse

Turns out, we need a new design recipe. The organization of a parser must be driven by the desired output not the given input.


  #lang racket
  ;; external representation 
  #; {AEJ = Integer || [AEJ,"+",AEJ] || [AEJ,"*",AEJ]}
  ;; internal representation 
  (struct node [op left right] #:transparent)
  #; {AE  = Integer || (node + AE AE) || (node * AE AE)}
  ;; data examples
  (define js-ex1 42)
  (define ae-ex1 42)
  (define js-ex2 '(1 "+" 1))
  (define ae-ex2 (node + 1 1))
  ;; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  ;; maximal external representation (all of JSON)
  #; {JSexpr = Boolean || Number || String  || Hash || [Listof JSexpr]}
  (define js-ex3  "ouch!")
  (define js-ex4 (make-immutable-hash '((plus . js))))
  ;; internal representation with parse errors 
  #; {AE-error = Integer || (node O AE-error AE-error) || String}
  #; {O = + || *}
  ;; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  (module+ test (require rackunit))
  ;; parser
  ;; JSexpr -> AE-error 
  (module+ test
    (check-equal? (parse js-ex1) ae-ex1)
    (check-equal? (parse js-ex2) ae-ex2)
    (check-equal? (parse js-ex3) "String is not an AEJ")
    (check-equal? (parse js-ex4) "Hash is not an AEJ"))
  (define (parse a-js)
      [(boolean? a-js) "Boolean is not an AEJ"]
      [(number? a-js)  (if (integer? a-js)
    "this Number is not an AEJ")]
      [(string? a-js)  "String is not an AEJ"]
      [(hash? a-js) "Hash is not an AEJ"]
      [(list? a-js)
       (match a-js
         [(list left "+" right) (node + (parse left) (parse right))]
         [(list left "*" right) (node * (parse left) (parse right))]
         [_ "this Array is not an AEJ"])]))
  (module+ test
    (define js-ex5 (list (list 1 "+" js-ex4) "*" js-ex3))
    (define ae-ex5
        (node + 1 "Hash is not an AEJ")
        "String is not an AEJ"))
    (check-equal? (parse js-ex5) ae-ex5))

Figure 5: An AE Parser

The exact shape of your parser depends on your JSON reader and the choice of internal JSON representation in your meta-language, but it will pretty much look like the above.


We now consider parsing a closed subject—except for homework.

When we wish to discuss new linguistic construct to study, we show the concrete—occasionally called "surface"—syntax and the extension of the abstract syntax representation.

So suppose we wish to move from dealing with the rather boring Beginning Student Language to the oh so exciting Intermediate Student Language (ISL), specifically the concept of variables in programming languages. For now, we deal with locally defined variables, that is, variable definitions that have a delimited range of validity. In ISL we use local for these things:
(local ((define x 5))
  (+ x x))

Here is how we introduce this into our JSON language:

    VEJ = Integer | [VEJ,"+",VEJ] | [VEJ,"+",VEJ]

        | ["let",String,VEJ,AEJ] | String

Like in ISL we need two new concepts: a variable definition and a reference to variables. In JSON it is natural to represent variables with strings. Everything else should look strangely familiar. Using this JSON subset the ISL example is written as

    ["let", "x", 5, ["x", "+","x"]]

Stop! Make up an example with nested local variable declarations.

The abstract syntax representation also needs two extensions: one for declarations, which are complex, and one for variable references, which are just names:
(struct node [op left right] #:transparent)
(struct decl [variable value scope] #:transparent)
; VE  = Integer | (node + VE VE) | (node * VE VE)
;     | (decl String VE VE) | String

Changing the parser is straightforward. Here is the essence:
; parser
; JSexpr -> VE-error
(define (parse a-js)
    [(boolean? a-js)
     "Boolean is not an VEJ"]
    [(number? a-js)
     (if (integer? a-js) a-js "this Number is not an VEJ")]
    [(string? a-js)
    [(list? a-js)
     (match a-js
       [`(,left "+" ,right)
        (node + (parse left) (parse right))]
       [`(,left "*" ,right)
        (node * (parse left) (parse right))]
       [`("let" ,(? string? x) ,value ,scope)
        (decl x (parse value) (parse scope))]
       [_ "this Array is not an VEJ"])]
    [(hash? a-js) "Hash is not an VEJ"]))
Stop! Make up input-output examples, following the above recipe.