Introduction To Lexical Analysis In 4 Simple Points

img
Ajay Ohri
Share

Introduction

The compilerโ€™s scanner unit is theย lexical analyzerย or the first step of Lexical Analysis. Itย converts the input programโ€™s high-level grammar into a token sequence in operation called tokenization. Thus as in the diagram shown below, the output of the compiler is a token-sequence sent to the parser for syntax correction and analysis in a process known asย lexical analysis in compiler designย implemented with finite deterministic automata.ย 

  1. What is lexical analysis?
  2. What is a token?
  3. Example of tokens
  4. Example of Non-Tokens

1) What is lexical analysis?

Lexical analysisย groups sounds, symbols, or token into a string of or sequence of tokens/units, forming a syntax that is meaningful and known as a parsing operation in linguistics. However, computer science calls this process parsing and/or tokenizing, depending on its use. The differences between the parser andย output of the lexical analyzerย are that the parser is used for the analysis of the syntax while the lexical analyzer scans the programโ€™s input.

The lexer issues tokens, whereas the parser provides a codeโ€™s abstract representation. The parser updates the symbol table entries, whereas the lexical analysis process inserts tokens into the symbol table. A parse tree is produced as the output of the parser, whereas the lexical analyzer provides lexical errors orย issues in the lexical analysisย as its output.

ย 2) What is a token?

The lexical token is the output of the lexical analyzer and is a string having a unit of the grammar used in the programming language represented as a sequence of characters. Theย role of the lexical analyzerย and the process adopted by theย Lexical Analyzerย is that it first tokenizes the program dividing it into valid tokens and removing all characters of white space. It also removes comments and helps in providing the lexical errors based on the column and row numbers of the input data. Take, for example, the below diagram.ย This lexical analyzer example identifies lexical errors using the grammar of C or C and the automation machine providing the column and row number of the errors.

ย If we pass a lexical analysis example statement like a= c b through a lexical analyzer, the output it generates is a token sequence that appears as id=id id; where each of the LHS and RHS terms represent the details of a variable in its symbol table.

Now take the program below to explain the process of lexical analysis.

int main()

{

ย ย // 2 variables

ย ย int b, a;

ย ย b = 20;

ย return 0;

}

The valid tokens that are generated afterย recognition of tokens are ‘main’, ‘int’,'{‘,'(‘ ย ‘)’,’int’, ‘b’, ‘a’, ‘,’, ‘;’,’b’,’=’,’20’,’;’, ‘return’, ‘0’,’;’ and’}’.ย 

Comments have been omitted.ย 

Look at another example illustrated below as the printf statement. –

 

Here the printf statement has five valid tokens. The standard compiler used is the JavaCC compiler, which forms theย lexical analysis toolsย of the lexer scanner and parser all-in-one and can be used for both parsing and lexical analyzer operations. The use of lexical analysis has certain advantages like simple design, separation of the syntax and lexical analysis parts allowing one to simplify either of these operations and enhancing the portability and efficiency of the compiler.

3) Example of tokens

There are 3 types orย specification of tokensย examples, namely

Type tokens like number, real, id etc.

Punctuation token like void, IF, return, etc

Alphabetic tokens like keywords etc.

ย These groups can be understood by theย lexical analysisย examples below.

Keywords like while, for, if etc.

Identifiers like function name, variable name,etc.

Operators like ‘ย  ‘, ‘ ‘, ‘-‘ etc.

Separators like ‘;’ ‘,’โ€™.โ€™etc.

Hereโ€™s an example of working out.ย Count number of tokens in int max(int i)

  • The Lexical analyzer reads int and accepts and outputs a valid token.
  • It then readsย max,ย and since it accepts this as a valid function name issues a valid token.
  • This process continues until it finally issues seven tokens.
  • The tokens areย int, (, max, int, ), i and; not in that sequence of course!

4) Example of Non-Tokens

Having commented on tokenization and the various tokens, one will also need to define the non-tokens in the design of the lexical analyzer. These are part of the input but not issued valid tokens. For Ex: spaces between words, comments, macros, preprocessor directive, blanks, newline, tabs, etc. A Lexeme or token is used to refer to the sequence of characters matched to a form using a pattern aka token. The corresponding sequence of characters in the input that forms a single token is a lexeme. A Lexical error is the sequence of input characters not matching the token patterns inย lexical analysis. Lexical phase errors are identified in theย lexical analysisย programโ€™s execution.

Conclusion

The above article has explained lexical analysis, theย role of lexical analysis, the process of tokenization, examples, and non-examples of tokens and the differences between syntax and lexical analysis. Lexical analysisย is used with the source code 1 character at a time which it then converts into tokens or meaningful lexemes. In syntax analysis, the tokens generated are output as the parse tree.

If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Business Analyst Training, a 10-month online program, in collaboration with IIM Indore!

Also Read

 

Related Articles

loader
Please wait while your application is being created.
Request Callback