How to Describe The Syntax of a Programming Language

Syntax

What will be covered ?

Compilers vs Interpreters
Sequence of building an executable program
Phases of Compilation

Lexical Analysis - Regular Expressions

Syntactic Analysis - Extended Backus-Naur Form (EBNF)

Recursive Descent Parser vs Shift Reduce Parser
What does it mean? Semantics

Compilers vs Interpreters

Translators:

Convert source code to source code

Original C++ --> C
Java --> C#

Compilers:

Translates source text code written in an higher level language to object code or machine language (executable program)

C++, C, Java (both)

Virtual Machines

Pascal, Smalltalk, Java, .Net

Executable program + Input data ==> Output data

Interpreters:

Translates and executes source code

Lisp, Basic, Prolog, Java(both),

Source code + Input data ==> Output data

What are the trade offs?

User's view of a Classical Sequence for Running a Program

Edit
Compiler

usually creates object files

Linker (link editor)

creates executable file

Loader

Phases of Compilation

Front end vs back end
Symbol Table

Mapping of identifiers (name) to attributes
Hashtable, data structure, is the usual implementation
Used by the various phases
Used to check static semantic errors

Variations

Hiding steps

To build an executable, a.out, in gcc

gcc mainProgram.c

Use switches to delimit intermediate steps

gcc mainProgram.c -S as mainProgram.s -o mainProgram.o ld is the loader command

Integrated Development Environments
Delayed Linking

Link time vs runtime.

Specification of a Programming Language

Definition of

Language form: Formal Syntax

The rules governing the formation of statements in a programming language.(Dictionary.com)

Tokens (words of the program)
Context Free Grammar

Language meaning : Formal Semantics

The study of the relationships between various signs and symbols and what they represent. (Dictionary.com)

Lexical Analysis and Tokens

Tokens are basic building blocks of a program

keywords / reserve words --examples
literals -- examples
variables

Regular expressions are used to specify tokens

Finite Grammars
Grammars are used to describe recursive patterns
flat data

Scanners

Lex is a widely used package to build compilers(http://dinosaur.compilertools.net/)
Selecting tokens

delimiters -- whitespace and look ahead - new tokens

Principle of longest substring
xtemp=ytemp

Syntactic Analysis and BNF

The legal organization of tokens into statements are described by a context free grammar (CFG).
BNF (Backus-Naur Form) is a meta syntax for expressing CFGs. Usually used as the notation for a programming language's grammar.
Set of productions (also called rules)

<expr> ::= <expr> + <term>

| <term>

Terminal symbols

if

+

begin

Non terminal symbols

<statement>

<compilation-unit>

Example - Grammar for a Small Language

This is an example of a grammar that generates expressions with infix operators.
Expression consist of operands and operators.

<program> ::=

begin

<stmt_list> end

<stmt_list> ::=

<stmt>

| <stmt> ; <stmt_list>

<stmt> ::=

<var> := <expression>

<var> ::=

A

| B

| C

<expression> ::=

<var> + <var>

| <var> - <var>

| <var>

Derivation of
begin A := A + B end

<program> => begin <stmt_list> end

=> begin <stmt> end

=> begin <var> := <expression>end

=> begin A := <expression>end

=> begin A := <var> + <var> end

=> begin A := A + <var> end

=> begin A := A + B end

This is a leftmost derivation.
Example of top down parsing.
Each string in the derivation is called a sentential form.
Example from Sebesta 96

Parse Tree is a Visualization of a Derivation
show the analysis

Abstract Syntax Tree

Can be produced directly by a Parser.
Shows only result
Eliminates redundant information.
Each node's degree depends on the arity of the operator.

Expression Tree for A := A+ B -- Click to Enlarge

pre

post

infix expression

A Rule for <if-statement>

<stmt> ::=…|<if_stmt>

<if_stmt> ::=if<logic_expr>then<stmt>

| if<logic_expr>then<stmt>else<stmt>

What is the parse tree for the sentential form:

if <logic_expr> then if <logic_expr> then <stmt> else <stmt> ?

When there exist two different parse tree for the same sentential form the grammar is ambiguous

Sometimes languages are ambiguous

Why is this a problem?
How can this problem be handled?

C handles the ambiguity by adding a disambiguating rule:
each else matches the closest unmatched if

Modula-2 removes the ambiguity by changing the grammar:

<stmt> ::= if <logic_expr> then <stmt> end | if <logic_expr> then <stmt> else <stmt> end

There are 2 different derivations that derive different strings

<stmt> => if <logic_expr1) then <stmt> end

=> if <logic_expr1) then if <logic-expr2> then <stmt1> else <stmt2> end end

VS
<stmt> => if <logic_expr1) then <stmt> else <stmt2> end

=> if <logic_expr1) then if <logic-expr2> else <stmt1> end <stmt2> end

A more general solution:

<stmt> ::=

<matched> ::=

<unmatched> ::=

There is only one possible parse tree. Draw it for yourself.

inherently ambiguous.

aⁿbⁿc^md^m

n,m

aⁿb^mc^mdⁿ

n,m

Operators

What is an operator?

Operator are functions with special symbols i.e.

+ & ^

Can be

part of the grammar : C
defined in a library : Haskell

What should the order of symbols and variables be for an operator?

Most languages use infix notation

5 + 4

Forth uses postfix notation

5 4 +

Lisp uses prefix notation

+ 5 4

Infix notation

In infix notation needs associativity, precedence and parentheses to resolve ambiguities.
Which operator does "b" belong to?

a + b * c

Operator Precedence

Another ambiguous grammar: G1.

::=

a + b * c ??

Unambiguous equivalent grammar: G2

::=

a + b * c ??

What would you do to the grammar to add the '-' and '\' operator?

Associativity of Operators:

A := A + B + C

Don't confuse with (semantic) associative rules in mathematics
Consider the following grammar:

::=

A rule is left recursive if LHS appears at the beginning of RHS.
Left recursive yields left associativity
Right recursive rule yields right associativity.

Note: Prefix and postfix notations do not produce ambiguous expressions and therefore precedence and associativity are unnecessary.

Evaluation policies

Expressions are evaluated by repeated application of rewrite rules
Each application of a rewrite rule is called a reduction
An expression to which a rewrite rule can be applied is a reducible expression (or redex)
Example : redex of

75/5 + 3*2

are

75/5 and 3*2

When an expression contains two or more redexes, the choice of the redex(es) to be reduced is governed by an evaluation policy

leftmost
rightmost
parallel
innermost
outermost

Can be specified in the language or left to the compiler writer

Operators of Common Languages

Parsing Techniques and Tools:

EBNF is an alternative grammar notation with shorter productions

Additional meta symbols:

repetitions (zero or more) are enclosed in { }

options are in enclosed in []

BNF :

expr := expr + term | term
The left recursive rule implies left-associative operator
Unrolling: ((expr + term) +term) + term
((term + term) + term)+... term

Converting to EBNF:: expr := term { + term}
is understood to mean left-associative operator Whereas,: expr := { term @} term
is understood to mean right-associative operator or: expr := term [ @ term ]

Recursive-decent parser ( top down)

Converting the grammars to a program

non-terminals are function calls

"match" terminals

What is the problem converting the following BNF production?

expr := expr + term | term

Why is replacing with the following incorrect?

expr := term + expr | term

EBNF notation suggests a solution

expr := term { + term}

like a while loop

recursive descent parser code from "Programming Languages" by Louden Fig4_12.c

Shift-reduce parser (bottom up)

Yacc is a widely used package to build compilers(http://dinosaur.compilertools.net/)

Exam ple of a Shift Reduce Parser
Example based on Kozen, Automata and Computability

What does the code mean?

Language reference manual
Translators

Formal method

Operational Semantics

description based on the operation of a actual or hypothetical machine

Axiomatic Semantics

Used to show correctness
Apply predicate calculus rules to pre conditions and statements to show post conditions holds

Denotation Semantics

Uses mathematical functions on programs to specify semantics.
Programs are translated into functions about which properties can be proved using mathematical theory of functions.