The compilerโs scanner unit is theย lexical analyzerย or the first step of Lexical Analysis. Itย converts the input programโs high-level grammar into a token sequence in operation called tokenization. Thus as in the diagram shown below, the output of the compiler is a token-sequence sent to the parser for syntax correction and analysis in a process known asย lexical analysis in compiler designย implemented with finite deterministic automata.ย
Lexical analysisย groups sounds, symbols, or token into a string of or sequence of tokens/units, forming a syntax that is meaningful and known as a parsing operation in linguistics. However, computer science calls this process parsing and/or tokenizing, depending on its use. The differences between the parser andย output of the lexical analyzerย are that the parser is used for the analysis of the syntax while the lexical analyzer scans the programโs input.
The lexer issues tokens, whereas the parser provides a codeโs abstract representation. The parser updates the symbol table entries, whereas the lexical analysis process inserts tokens into the symbol table. A parse tree is produced as the output of the parser, whereas the lexical analyzer provides lexical errors orย issues in the lexical analysisย as its output.
The lexical token is the output of the lexical analyzer and is a string having a unit of the grammar used in the programming language represented as a sequence of characters. Theย role of the lexical analyzerย and the process adopted by theย Lexical Analyzerย is that it first tokenizes the program dividing it into valid tokens and removing all characters of white space. It also removes comments and helps in providing the lexical errors based on the column and row numbers of the input data. Take, for example, the below diagram.ย This lexical analyzer example identifies lexical errors using the grammar of C or C and the automation machine providing the column and row number of the errors.
ย If we pass a lexical analysis example statement like a= c b through a lexical analyzer, the output it generates is a token sequence that appears as id=id id; where each of the LHS and RHS terms represent the details of a variable in its symbol table.
Now take the program below to explain the process of lexical analysis.
int main()
{
ย ย // 2 variables
ย ย int b, a;
ย ย b = 20;
ย return 0;
}
The valid tokens that are generated afterย recognition of tokens are ‘main’, ‘int’,'{‘,'(‘ ย ‘)’,’int’, ‘b’, ‘a’, ‘,’, ‘;’,’b’,’=’,’20’,’;’, ‘return’, ‘0’,’;’ and’}’.ย
Comments have been omitted.ย
Look at another example illustrated below as the printf statement. –
Here the printf statement has five valid tokens. The standard compiler used is the JavaCC compiler, which forms theย lexical analysis toolsย of the lexer scanner and parser all-in-one and can be used for both parsing and lexical analyzer operations. The use of lexical analysis has certain advantages like simple design, separation of the syntax and lexical analysis parts allowing one to simplify either of these operations and enhancing the portability and efficiency of the compiler.
There are 3 types orย specification of tokensย examples, namely
Type tokens like number, real, id etc.
Punctuation token like void, IF, return, etc
Alphabetic tokens like keywords etc.
ย These groups can be understood by theย lexical analysisย examples below.
Keywords like while, for, if etc.
Identifiers like function name, variable name,etc.
Operators like ‘ย ‘, ‘ ‘, ‘-‘ etc.
Separators like ‘;’ ‘,’โ.โetc.
Hereโs an example of working out.ย Count number of tokens in int max(int i)
Having commented on tokenization and the various tokens, one will also need to define the non-tokens in the design of the lexical analyzer. These are part of the input but not issued valid tokens. For Ex: spaces between words, comments, macros, preprocessor directive, blanks, newline, tabs, etc. A Lexeme or token is used to refer to the sequence of characters matched to a form using a pattern aka token. The corresponding sequence of characters in the input that forms a single token is a lexeme. A Lexical error is the sequence of input characters not matching the token patterns inย lexical analysis. Lexical phase errors are identified in theย lexical analysisย programโs execution.
The above article has explained lexical analysis, theย role of lexical analysis, the process of tokenization, examples, and non-examples of tokens and the differences between syntax and lexical analysis. Lexical analysisย is used with the source code 1 character at a time which it then converts into tokens or meaningful lexemes. In syntax analysis, the tokens generated are output as the parse tree.
If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Business Analyst Training, a 10-month online program, in collaboration with IIM Indore!
Fill in the details to know more
Understanding the Staffing Pyramid!
May 15, 2023
From The Eyes Of Emerging Technologies: IPL Through The Ages
April 29, 2023
Understanding HR Terminologies!
April 24, 2023
How Does HR Work in an Organization?
A Brief Overview: Measurement Maturity Model!
April 20, 2023
HR Analytics: Use Cases and Examples
10 Reasons Why Business Analytics Is Important In Digital Age
February 28, 2023
Fundamentals of Confidence Interval in Statistics!
February 26, 2023
Everything Best Of Analytics for 2023: 7 Must Read Articles!
December 26, 2022
Bivariate Analysis: Beginners Guide | UNext
November 18, 2022
Everything You Need to Know About Hypothesis Tests: Chi-Square
November 17, 2022
Everything You Need to Know About Hypothesis Tests: Chi-Square, ANOVA
November 15, 2022
Add your details:
By proceeding, you agree to our privacy policy and also agree to receive information from UNext through WhatsApp & other means of communication.
Upgrade your inbox with our curated newletters once every month. We appreciate your support and will make sure to keep your subscription worthwhile