2 — Parsing
Friday, 10 January 2020
You might be wondering how information gets turned into a data representation.
In Fundamentals I, most of your programs use big-bang to turn information such as clock ticks, keystrokes, or mouse events into data (and the to-draw clauses turns data into information). But this is all pretty shallow information. For programming languages, we need more.
#lang racket ; EFFECT read from STDIN, compute, write to STDOUT (define (min) (write-json (compute (read-json))))
Information has a concrete representation, namely JSON.
Reading transforms information into its data representation.
Writing renders a data representation as information.
lexing or tokenization, and
parsing proper
Why Parse?
[launchNuke,"+",[1,"*","ouch!"]] |
Therefore we parse first.
What is Parsing
In essence, a parser is a program that determines whether a given piece of data represents a program in a subject language.
AEJ = Integer || [AEJ,"+",AEJ] || [AEJ,"+",AEJ] |
(struct node [op left right] #:transparent) ; AE = Integer || (node + AE AE) || (node * AE AE)
; JSexpr = Boolean | Number | String | [Listof JSexpr] | Hash
; JSexpr -> Boolean ; determine whether this JSexpr represents an AE
What is Parsing, Again
Your IDEs use the parser of the subject language to color the syntax, highlight violations of the grammar rules, and even for name completion. Returning a Boolean won’t do.
What we really want is a piece of data that is either a representation of a subject programs or similar to one but contains a marker where the parser failed. Since parsing is a boring Compilers subject, we use plain Strings to indicate where errors occur because (1) they do mot occur inside of AE expressions and (2) they can carry information that an IDE could use.
; AE = Integer || (node + AE AE) || (node * AE AE) ; AE-error = Integer || (node + AE-error AE-error) || (node * AE-error AE-error) ; || String
; JSExpr -> AE-error
How to Parse
Turns out, we need a new design recipe. The organization of a parser must be driven by the desired output not the given input.
Lectures/2/parser.rkt
#lang racket ;; external representation #; {AEJ = Integer || [AEJ,"+",AEJ] || [AEJ,"*",AEJ]} ;; internal representation (struct node [op left right] #:transparent) #; {AE = Integer || (node + AE AE) || (node * AE AE)} ;; data examples (define js-ex1 42) (define ae-ex1 42) (define js-ex2 '(1 "+" 1)) (define ae-ex2 (node + 1 1)) ;; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ;; maximal external representation (all of JSON) #; {JSexpr = Boolean || Number || String || Hash || [Listof JSexpr]} (define js-ex3 "ouch!") (define js-ex4 (make-immutable-hash '((plus . js)))) ;; internal representation with parse errors #; {AE-error = Integer || (node O AE-error AE-error) || String} #; {O = + || *} ;; - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - (module+ test (require rackunit)) ;; parser ;; JSexpr -> AE-error (module+ test (check-equal? (parse js-ex1) ae-ex1) (check-equal? (parse js-ex2) ae-ex2) (check-equal? (parse js-ex3) "String is not an AEJ") (check-equal? (parse js-ex4) "Hash is not an AEJ")) (define (parse a-js) (cond [(boolean? a-js) "Boolean is not an AEJ"] [(number? a-js) (if (integer? a-js) a-js "this Number is not an AEJ")] [(string? a-js) "String is not an AEJ"] [(hash? a-js) "Hash is not an AEJ"] [(list? a-js) (match a-js [(list left "+" right) (node + (parse left) (parse right))] [(list left "*" right) (node * (parse left) (parse right))] [_ "this Array is not an AEJ"])])) (module+ test (define js-ex5 (list (list 1 "+" js-ex4) "*" js-ex3)) (define ae-ex5 (node * (node + 1 "Hash is not an AEJ") "String is not an AEJ")) (check-equal? (parse js-ex5) ae-ex5))
The exact shape of your parser depends on your JSON reader and the choice of internal JSON representation in your meta-language, but it will pretty much look like the above.
Variables
We now consider parsing a closed subject—
When we wish to discuss new linguistic construct to study, we show the
concrete—
VEJ = Integer | [VEJ,"+",VEJ] | [VEJ,"+",VEJ] |
| ["let",String,VEJ,AEJ] | String |
["let", "x", 5, ["x", "+","x"]] |
(struct node [op left right] #:transparent) (struct decl [variable value scope] #:transparent) ; VE = Integer | (node + VE VE) | (node * VE VE) ; | (decl String VE VE) | String
; parser ; JSexpr -> VE-error (define (parse a-js) (cond [(boolean? a-js) "Boolean is not an VEJ"] [(number? a-js) (if (integer? a-js) a-js "this Number is not an VEJ")] [(string? a-js) a-js] [(list? a-js) (match a-js [`(,left "+" ,right) (node + (parse left) (parse right))] [`(,left "*" ,right) (node * (parse left) (parse right))] [`("let" ,(? string? x) ,value ,scope) (decl x (parse value) (parse scope))] [_ "this Array is not an VEJ"])] [(hash? a-js) "Hash is not an VEJ"]))