Programming Assignment 1:
Lexical Analyzer
Due Time: 11:59AM(EST) on February 10, 2008
You can choose to use Linux, cygwin (Windows), UNIX or other operating systems where LEX is supported. However, before you submit your assignment, you are required to run your program in SunOS operating system so we can grade your homework easily. You should use an ssh client to connect to csy[01-12].cs.wmich.edu in order to use SunOS system.
I. Problem Description
In the four CS5810 projects, you will have an opportunity to build a compiler for the language Bronco#, which is similar to C. In this first programming assignment, you will find all tokens and the corresponding lexeme. All the tokens are based on the following description.
1.1 Keyword
"int"
"float"
"short"
"record"
"array"
"type"
"pointer"
"enum"
"void"
"new"
"end"
"while"
"if"
"of"
"else"
"print"
"ref"
"function"
"main"
"println"
"return"
The output for the keywords should be like (keyword_if, ) if the keyword if is read.
1.2 Lexical Elements
Your scanner will also be required to recognize several lexical elements and return their token name. Among these are:
1. INT_CONST: All non-negative integers. (No need to support scientific or
hex notation)
2. IDENTIFIER: All valid C variables/identifiers
3. REAL_CONST: All real numbers
4. STRING: Any (double) quoted sequence of character
The output for the lexical elements should be like (IDENTIFIER, abc) if abc is read.
1.3 Symbols
Your scanner will also be required to recognize the following symbols:
";": token category SEMI_COLON
"," : token category COMMA
"-": token category MINUS
"+": token category PLUS
"*": token category STAR
"!": token category EXCLAMA
"||" : token category OR
"..": token category TO
"!=": token category NOTEQUAL
"&&": token category AND
"->": token category MEMBER_OF_POINT2
">=": token category GE
"<=": token category LE
">": token category GT
"<": token category LT
"==": token category EQ
"/": token category DIVISION
"!": token category EXCLAMA
".": token category DOT
"=": token category ASSIGN_EQ
"(": token category LEFT_PARAN
")": token category RIGHT_PARAN
"{": token category LEFT_CUR_BRACE
"}": token category RIGHT_CUR_BRACE
"[": token category LEFT_SQ_BRKT
"]": token category RIGHT_SQ_BRKT
The output for the symbols should be like (OR, ||) if || is read.
1.4 Comments
There are two kinds of comments in Bronco#.
1. One is the line comment which starts with // and ends with the end of a line.
2. The other comment is the multiple line comment which starts with /* and ends with */. Any character between them including newlines should be ignored.
The output of the comments is the pair of (COMMENT, content_of_comment).
1.5 Other Requirements
1. You should print out the total number of lines in the input file at
the end of your output.
2. You should use patterns to consume the input string, not your program to reanalyze the input string.(This is the way you need to find any comments in an input file,)
II. Testing Data:
1. This is the sample file and here is the corresponding output file.
2. More files is p1_02.ds, p1_04.ds, p1_05.ds.
III. Hand In:
Send your source file (.*l) to Bilal Abubark at bilal.abubakr@wmich.edu, and cc to zijiang.yang@wmich.edu.
with Subject: Course Number:PA1: Your Name.