The use of symbol

332
The use of symbol-state tables
A. C. Day
Computer Centre, University College London, 19 Gordon Street, London WC1
This paper describes how a certain kind of table may be used to check the syntax of a string of
symbols. In its simplest form, a symbol-state table closely resembles afinitestate machine, and
requires a large amount of space, often to store duplicated information. By means of the
subroutine principle such duplication may be avoided, and recursive use of the table achieved,
making possible the checking of recursive structures. When syntax checking is not the only aim, or
when checking needs to be performed which is beyond the scope of the table, sections of program
coding may be called from the table. Applications have included the writing in FORTRAN of
a syntax checker for FORTRAN, and the automatic proof reading of dictionary entries.
(Received January 1970)
Arrays are often found to be useful as an alternative way
of storing programming information, rather than coding
a great many conditional instructions. An example of
this is the use of decision tables, in which the fulfilment
of a set of conditions leads to the execution of a set of
operations, the conditions and the operations being
specified by the table. Another example is that of
transition matrices, as used in compiler writing. Usually
in this case the next symbol in the input stream and the
symbol at the top of a stack are used to denote the row
and column of the matrix. Their intersection gives the
place in the matrix at which is stored the address of a
subroutine to handle the incoming symbol. This is
shown diagrammatically in Fig. 1.
The tables described in this paper are also used to store
supplementary programming information, and are useful
Input
symbol
aids when any kind of syntax checking needs to be
performed. In a symbol-state table the next symbol in
the input stream denotes the column, and the current
state gives the row. The intersection of these in the
table yields a datum which may be the new state, an
error indication, or certain other kinds of information
(see Fig. 2). If the datum in the table is always a new
state, then the symbol-state table is equivalent to a finite
state machine with no output. If the datum is always a
subroutine address, then the table is a transition matrix.
However, there are many other possibilities. I will deal
here with the practical possibilities rather than the
theoretical classifications.
Examples of simple symbol-state tables
An elementary example should make clear the way in
which a symbol-state may be used for syntax checking.
Let us suppose that it is necessary to check the syntax of
FORTRAN GOTO statements. We will further suppose
that the keyword GOTO has already been identified, so
that all that needs to be done is to check the remaining
non-blank characters of the statement. These will be
Input
symbol
stack
symbol
;urrent
state
Subroutine
Fig. 1. Transition matrix
The Computer Journal Volume 13 Number 4 November 1970
Fig. 2. Symbol-state table
333
Tlie use of symbol-state tables
FOLLOWS:
SYMBOL ALPH. NUM.
STATE
1
EXPECTS:
GO TO
/label
\{label,
1
2
digit
/digit
\EOS
2
2
label,
3
4
/digit
\>
4
4
label
5
6
6
6
digit
digit
digit
i
)
4
when programming in absolute non-symbolic machine
code. The left-hand column of comments gives some
indication of the symbols which have been found immediately before reaching that state. The right-hand
column shows which symbols are to be expected (and
therefore permitted).
The table in Fig. 3 must be used in conjunction with a
driving program, of course. A suitable section of
FORTRAN coding for performing this task is given in
Prog. 1. It is assumed that the table in Fig. 3 has been
loaded into the array called TABLE in Prog. 1. This
array has been dimensioned with 99 rows to allow for
expansion as we deal with more complex examples.
EOS
5
6
3
0
5
7
5
)
DIMENSION TABLE(99,6)
8
7
f alph. char.
part of name < digit
EOS
8
9
9
9
INTEGER STATE, TYPE, TABLE
0
9
C SET INITIAL STATE
Fig. 3. Checking simple and computed GO TO
STATE = 1
C GET THE TYPE OF THE NEXT INPUT SYMBOL
our input symbols, which we will consider to be terminated by a further symbol called EOS (end of statement).
A subroutine (which we will call NEXT) is needed to
deliver the type of the next symbol. The types will be:
1.
2.
3.
4.
5.
6.
1 CALL NEXT (TYPE)
C RETRIEVE THE NEXT STATE FROM THE TABLE
Alphabetic characters.
Numeric digits.
Left bracket.
Right bracket.
Comma.
EOS.
STATE = TABLE (STATE, TYPE)
C TEST FOR ERROR, EXIT OR CONTINUE
IF (STATE) 3, 2, 1
C SUCCESSFUL EXIT - NO ERRORS FOUND
2 CONTINUE
For simplicity we will first assume that the only permittcu
forms are the simple GOTO and the computed GOTO,
i.e. the forms:
*
GOTO label
GOTO (label, label. . .), name
C ERROR EXIT
3 CONTINUE
where label represents a string of 1 to 5 decimal digits,
and name must begin with an alphabetic character, which
is then followed by 0 to 5 alphabetic or numeric
characters.
A symbol-state table which will check the syntax of
these GOTO statements is shown in Fig. 3. Certain
cells are left blank for greater readability. These would
normally be filled with negative numbers, indicating an
error condition. A zero cell indicates that the whole
statement has been checked, with no errors detected.
A positive cell is to be interpreted as the new state. On
the left-hand side of the table in Fig. 3 there are two
columns of comments which have been found useful
when constructing large or complex tables. The problems which arise then are similar to those encountered
Prog. 1
We will now see how the table in Fig. 3, driven by the
coding above, checks the FORTRAN statement:
GOTO (3, 426, 79), IPT
The table is entered with an initial state of 1 when the
next symbol in the input stream is the one immediately
following the GOTO. When the cell in the table is
positive, it is taken to be the state for examining the next
symbol. The sequence of states may be shown as
follows:
Next symbol
(
3
»
4
2
6
,
7
9
)
Symbol
3
2
5
2
2
2
5
2
2
4
State
1
Cell of table
3 ^
4
X
5""
I
^ 1 1
A8-^jU89-^
P
T
EOS
1
1
6
0
334
FOLLOWS :
GOTO
A. C. Day
EXPECTS:
(label
•I {label,
SYMBOL ALPH.
1
STATE
1
10
NUM.
(
2
2
EOS
4
5
6
3
j
char,
{alph.
digit
label,
digit
10
We cannot directly combine these two sections, because
after rows 3-6 we must look for
12
13
13
13
14
15
1)
15
15
EOS
16
/digit
11
10
12
label
digit
(label, label. . .)
10
11
name,
Fig. 4 shows the amendments which need to be made to
Fig. 3 in order for the table to test the syntax of all three
types of GOTO. Rows of the table in Fig. 3 which are
not mentioned in Fig. 4 are assumed to remain the same.
At this point there is considerable redundancy in the
table, as rows 3-6 are identical with rows 12-15. These
two sections are both checking for the structure:
, name
14
fdigit
16 14
0
Fig. 4. Amendment to Fig. 1 for checking assigned GO TO in
addition to simple and computed GO TO
In Fig. 3 there are 38 cells left blank, to be filled with
negative error flags. By choosing different negative
integers for different error conditions, we are able to
diagnose up to 38 different types of error. Of course,
such detail is hardly necessary, and several cells may be
given the same error number. At the error exit, we may
print out different messages according to the error
number, or we may print out a standard diagnostic
together with the error number itself.
The checking performed by the table in Fig. 3 is not
adequate for many purposes. For instance, we do not
test to see that names and labels do not exceed their
maximum permitted lengths. Nor do we ensure that
the name is integer in type. However, this is intended
merely as a simple example to show the workings of a
symbol-state table.
Note that the driving loop in this example is only three
FORTRAN statements in length. It is true that one of
these three is a call to a subroutine, but this subroutine
(NEXT) need not be very complex, and it would be
needed whatever method of checking was used. FORTRAN code is conserved because the table is in effect
interpretive code. This concept of the table as code is a
useful one, as it may be developed to cover subroutines
within the body of the table, which may in fact be called
recursively, and also calls from the table to sections of
program coding in order to perform special actions
which are beyond the scope of the table. These ideas
will be developed in the remainder of the article.
whereas after rows 12-15 only the end of statement is
permitted. The redundancy and waste of space which is
apparent here becomes progressively more of a problem
as the syntax which is tested becomes more detailed.
This is obviously the kind of problem which, when
coding, can be handled by a subroutine which is called
from two different places. The redundancy in the
symbol-state table under consideration can be overcome
by means of a table analogue for a subroutine. (Throughout the rest of this paper such an analogue will be
written as subroutine in italics to distinguish it from a
coded subroutine.) Instead of having two blocks of
rows which are identical, only one version of these rows
is kept, and when this is entered, the return address is
stored. The return address is here the row which
should be used as the old state on leaving the subroutine.
FOLLOWS:
GOTO name, {label, label. . .)
SYMBOL ALPH. NUM)
STATE
GOTO
(label
j {label,
[name, (
1
digit
/digit
\EOS
-,L
2:
3
6
2
903
3
),
name
4
5
falph. char.
part of name -< digit
[EOS
5
5
5
falph. char.
part of name •I digit
6
6
6
7
{label. . .)
EOS
8
(
label, label
9
10
digit
fdigit
10
10
label
11
12
J , '81
12
12
Fig. 5.
6
0
0
7
908
{label,
13
5
4
name,
anything
4
2
, name
label,
EOS
1
{label. . .)
digit
Table subroutines
A further type of GOTO statement exists in FORTRAN, namely, the assigned GOTO. This has the
format:
EXPECTS:
0
0
0
11
13 11
0
0
0
0
Checking simple, assigned and computed GOTOs
using a subroutine (rows 9-13)
The use of symbol-state tables
DIMENSION TABLE(99,6)
INTEGER STATE, TYPE, TABLE, RET
•
C SET INITIAL STATE AND RETURN ADDRESS
STATE = 1
RET = 0
C GET THE TYPE OF THE NEXT INPUT SYMBOL
1 CALL NEXT (TYPE)
C LOOK UP THE NEXT STATE IN THE TABLE
2 STATE = TABLE (STATE, TYPE)
C TEST FOR ERROR, EXIT OR CONTINUE
IF (STATE) 6, 4, 3
C POSITIVE STATE - IS IT MORE THAN 100
3 IF (STATE .LT. 100) GO TO 1
C SUBROUTINE CALL - UNPACK RETURN ADDR.
RET = MOD (STATE, 100)
STATE = STATE/100
GO TO 1
C I F RETURN ADDR. IS NOT ZERO, GO TO IT
k IF (RET .EQ. 0) GO TO 5
STATE = RET
RET = 0
GO TO 2
C SUCCESSFUL EXIT - NO ERRORS FOUND
5 CONTINUE
C ERROR EXIT
6 CONTINUE
Prog. 2
The table in Fig. 5 is the equivalent of those in Figs. 3
and 4 in the checking it performs, but rows 9-13 of the
former take the place of the duplicated rows of the latter.
Rows 9-13 of Fig. 5 are a subroutine for checking two or
more labels, separated by commas, and terminated by a
right bracket. This subroutine is called from two cells
of the table, rows 1 and 7, column 3 in each case. In
order to call a subroutine two row addresses are needed,
one to indicate the starting row of the subroutine, and
one for the return row. This information is packed into
one cell in Fig. 5 by the arbitrary method of multiplying
335
the subroutine address by 100 and adding to it the return
address. Of course, two tables could be used, one of
which would contain only the subroutine addresses for
calls, but this would be very wasteful of space.
A zero cell in the table now has two meanings. When
executing a subroutine it means 'go to the return address'
(as in row 13 of Fig. 5). Outside a subroutine it means
'exit from the table' (as in row 8, column 6 of Fig. 5).
The FORTRAN coding needed to drive the table will
now have to be amended. A variable RET will be used
to store the return address. Each time a positive datum
is obtained from the table it is now necessary to test
whether it is greater than 100. The coding will now be
that of Prog. 2.
In FORTRAN the assigned and computed GOTO
statements have in common the string
(label, label. . .)
It will be seen from Fig. 5 that the subroutine is not called
until the first common symbol (the left bracket) is
encountered. It usually happens that we do not know
whether the subroutine needs to be called until we reach
the first symbol which could have been checked by that
subroutine. In Fig. 5 the left bracket is checked in
rows 1 and 7, so that it is pointless to have the subroutine
check it over again. Therefore I adopt the convention
that on a call to a subroutine the next symbol is obtained
from the input stream ready for the subroutine to check
it. Thus in Fig. 5 the first symbol to be checked by the
subroutine is the first numeric digit inside the brackets.
A similar convention needs to be adopted on returning
from a subroutine. In the example used here, we know
that control should pass from the subroutine as soon as
the right bracket is found, so we could return from
row 12, column 4 of Fig. 5, obtaining the next symbol
from the input stream as we do so. However, the more
usual case is that we only realise that we ought to have
returned from the subroutine when examining the first
symbol which is not to be checked by that subroutine.
Therefore the convention I adopt is to return on the first
symbol not handled by the subroutine, and not to get the
next symbol from the input stream as control is passed
back. This is why, in the coding of Prog. 2, when a
return address is used as the next state, control is passed
to statement 2 rather than to statement 1. This means
that in Fig. 5 row 13 appears to be wasted. Its purpose
is really to get the next symbol beyond the right bracket,
in order to return with this symbol to the return address
(either row 8 or row 3), where this symbol will be checked.
It is not necessary to put error codes in row 13, e.g. to
trap an erroneous left bracket. All such checking can be
done when the return address is reached, i.e. in this case,
row 3 or row 8.
Once again, this is merely an example to show what
can be done by means of symbol-state tables. In this
case we have saved 3 rows out of 16 by using a subroutine. When the syntax to be checked is more complex, the savings which accrue by using subroutines
become very considerable indeed. The FORTRAN
loop which is executed when there are no calls, errors or
returns is still only four statements in length.
Nested and recursive calls
The FORTRAN coding for use with the table in
A. C. Day
336
Fig. 5 was written with the assumption that only one
subroutine would be called at a time. It is very useful to
be able to have nested calls. As a call to a symbol-state
table subroutine does not change the table in any way,
these calls may in fact be recursive with no additional
machinery.
As an example, let us take the expression
((G)/D - E)*M + P
and trace the sequence of states as this expression is
DIMENSION TABLE(99,6), STACK(50)
INTEGER STATE, TYPE, TABLE, STACK
FOLLOWS:
EXPECTS:
fbeginning
SYMBOL NAME
1
STATE
OP
(
2
EOS OTHER
4
5
6
^
2
*
C SET INITIAL STATE AND LEVEL OF CALLS
103
STATE = 1
exprn
fop
<)
tEOS
)
(exprn
2
1
3
0
0
2
LEVEL = 0
C GET THE TYPE OF THE NEXT INPUT SYMBOL
1 CALL NEXT (TYPE)
Fig. 6.
Checking expressions by recursive calls
Recursive techniques are particularly useful for checking recursive structures. As an example, let us consider
expressions which are defined by the following BNF
grammar:
C LOOK UP THE NEXT STATE IN THE TABLE
2 STATE = TABLE (STATE, TYPE)
C TEST FOR ERROR, EXIT OR CONTINUE
IF (STATE) 6, if, 3
<name> : : = A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|
R|S|T|U|V|W|X|Y|Z
C POSITIVE STATE - IS IT MORE THAN 100
<exprn> : : = <name>|«exprn»|<exprn><op><exprn>
C SUBROUTINE CALL - INCREMENT LEVEL
<op> ::= +|-|*|/
The third rule is a recursive definition of <exprn> in
terms of itself. Examples of expressions defined by this
grammar are:
F
A +B
«G)/(D - E)*M + (((P))))
The syntax of all expressions defined by this grammar
may be checked by the symbol-state table in Fig. 6.
For both nested and recursive calls the return addresses
must be kept in a stack. FORTRAN coding to drive
Fig. 6 is therefore a little more complicated, but Prog.
3 is sufficient.
Subroutine NEXT also needs to be amended in order
to deliver the types of symbols required by Fig. 6. An
extra column (column 6) has been added to Fig. 6 in
order to cope with invalid symbols which are encountered
by NEXT. When NEXT finds a symbol which it cannot
assign to any of the five types, it makes it type 6. All
of the cells in column 6 contain error codes, so that the
invalid symbol is trapped by the same mechanism which
finds symbols in invalid positions.
One weakness of Fig. 6 is that both a return from a
call, and exit from the table, can occur when either a
right bracket or the end of statement is reached. This
means that excess right brackets could cause exit from
the table before the whole statement has been checked.
This could have been avoided by adding extra rows to
the table, so that the first call to the subroutine is
separated from all other recursive calls. For interest,
this table is given in Fig. 7. However, the economy of
Fig. 6 may be preserved simply by the addition of the
statement labelled 5 in Prog. 3, to check on final exit from
the table that the end of statement has in fact been
reached.
3 IF (STATE .LT. 100) GO TO 1
LEVEL = LEVEL + 1
IF (LEVEL .GT. 50) GO TO 7
C UNPACK RETURN AND SUBROUTINE ADDR.
STACK (LEVEL) = MOD (STATE, 100)
STATE = STATE/100
GO TO 1
C IF STACK IS NOT EMPTY, RETURN
k IF (LEVEL .EQ. 0) GO TO 5
STATE = STACK (LEVEL)
LEVEL = LEVEL - 1
GO TO 2
C EXIT - NO ERRORS FOUND - TEST IF EOS
5 IF (TYPE .NE. 5) GO TO 6
*
C ERROR EXIT
6 CONTINUE.
C CALLS ARE NESTED TOO DEEPLY
7 CONTINUE
Prog. 3
337
The use of symbol-state tables
Next symbol
(
(
Symbol type
3
Old state
1
3
Top of stack
^ 3
Level
0
)
1
4
4
/
D
-
E
)
2
1
2
1
4
2 ^
0
2 ^
/
^3
1
2
2
SYMBOL NAME
STATE
1
3
3
3
1
1
1
1
1
1
OP
(
2
1
6
205
/name
2
3
204
op
3
(exprn
)op
4
(exprn
)EOS
5
fop
\EOS
P
EOS
2
1
2
1
5
6
2
)
4
EOS OTHER
5
6
0
6
1
0
Fig. 7. Expanded to check expressions by recursive calls
0
2 ^
3
beginning { ( n a m e
exprn
+
X
3
Action calls
Just as writers of programs in high level languages find
it convenient at times to call subroutines written in
assembly language, so writers of symbol-state tables
sometimes find it necessary to perform part of the syntax
checking by means of specially written program statements rather than by means of rows in the matrix. In
either case the motivation may be economy (of space or
time), or in order to do something which is not possible
in the basic medium. It is possible for a symbol-state
table to call sections of program coding, control usually
being handed back to the table again. A convenient
name for such a process is action call.
The need for this may be illustrated using Fig. 5.
exprn
M
0
1^
3
checked. In each state the stack contains the number
of values given by the variable LEVEL, but here only
the last value placed in the stack will be shown. The
table being used is that in Fig. 6. Note that on a call to
a subroutine the new state is used to give both the next
old state, and also the return address which is pushed
down on the stack. On return from a subroutine no new
symbol is obtained from the input stream, but the next
old state is derived by 'popping' the top of the stack.
The sequence of states will be as in the table above.
The power of symbol-state tables is seen in the fact
that only three rows are sufficient to check recursive
structures of considerable complexity. Rows 1-3 of
Fig. 6 may themselves be called as a subroutine if such
expressions occur at different places in higher level
structures. In that case the FORTRAN coding given
in Prog. 3 is sufficient for both the higher level structures
and the expression subroutine; no more coding is needed
for the driver program than that given.
op
4
*
2
103"' 103^
New state
G
0
0
0
0
0
0
Column 1 of rows 1 and 4, and columns 1 and 2 of
rows 5 and 6, are used to check for a FORTRAN
variable name, which must begin with an alphabetic
character, which may then be followed by up to five
alphabetic or numeric characters. All that the table in
Fig. 5 does is to check that the first character is alphabetic, and that any subsequent characters are alphabetic
or numeric. No check is made on the length of the
name, or its type (which for an assigned or computed
GOTO must be integer). The length of the name could
be checked by means of a subroutine, but this would use
up an extra five rows in the symbol-state table. At the
expense of an extra column in the table a test could be
made as to whether the first character of the name was
between I and N, but the type may have been set by an
explicit type statement. The only way to test the type
for sure is to keep a table of names and their attributes,
which a symbol-state table is unable to do.
What is needed is a section of program coding which
can be called from the table, and which will process a
name, checking its length and type. This could be
done by modifying Fig. 5, row 1, column 1, to the
number 10106. The convention would be that the
ten-thousands digit (1) indicates that this is an action
call. The next two digits (01) specify that action 1 is
required. The final digits (06) give the row to which
control should be returned after the action call. The
contents of columns 1 and 2 in row 6 are now irrelevant,
as the coding for the action call will process all alphabetic
and numeric characters belonging to the name. The
same action call will be required from row 4, column 5,
so this cell in the table will be changed to 10105.
In the same way, Fig. 5 does not check labels sufficiently, and could be supplemented with action coding
for this purpose. No FORTRAN label may be more
than five decimal digits in length. Also any thorough
checking of a FORTRAN program will need to keep a
table of labels, and must amend the table for all labels
referenced in GOTO statements to ensure that the
program segment contains these labels, and that they are
attached to executable statements. To check a label in
these ways could be the purpose of action 2. Then
row 1, column 2 of Fig. 5 would be changed to 10202,
meaning 'action call 2, return address 2'. All other
checking of labels in Fig. 5 would similarly use action 2.
The revised form of Fig. 5 is given in Fig. 8.
Obviously a FORTRAN driving program for use with
the table in Fig. 8 must be different from that used with
the table of Fig. 5. If the next state retrieved from the
table is greater than 100, then a further test must be made
to see whether it is greater than 10000, in which case the
three numbers must be unpacked from the cell. A computed GOTO based on the required action (here 1 or 2)
leads to the appropriate action coding.
338
A. C. Day
SYMBOL
STATE
1
ALPH.
1
NUM.
2
(
10106
10202
903
)
4
Multiple symbol-state tables
EOS
5
2
There are certain problems for which more than one
symbol-state table may be needed. For example, when
checking FORTRAN statements, the FORMAT statement will need a different symbol-state table (and a
different subroutine to deliver the type of the next input
symbol) from all other statements. This is because in
most FORTRAN statements there is no significant
difference in type between such characters as A, Fand X,
whereas in the FORMAT statement they must be treated
as different types of symbols. The number of columns
for a table checking FORMAT statements will normally
be greater than for one checking other statements. The
two tables could only be combined at the expense of a
considerable wastage of space.
It may even be necessary at times to have a hierarchy
of symbol-state tables, the higher tables checking the
order of items whose internal structure has been checked
by lower tables. For example, let us consider a hierarchy
of tables for checking the syntax of FORTRAN. The
lowest level tables will check the syntax of statements.
There will probably be two of these, one to check
FORMAT statements, and the other to check all other
statements. For both these tables the symbols will be
characters in the input stream.
A higher level table will be needed to check for the
ordering of statements within program segments
6
0
3
4
4
10105
5
0
6
7
7
908
8
0
10210
9
10
11
10212
11
12
0
13
0
0
13
11
0
0
0
Fig. 8. Fig. 5 modified to include action calls for checking
names and labels. (Note that rows 2, 5 and 8 could now
be condensed into one row. This has not been done in
order to preserve similarity with Fig. 5.)
Sis
z
5
Q
°.§
o<
5
SYMBOL
STATE
1
2
3
4
5
6
7
1
4
204
204
204
204
204
204
Action calls: 100
200
300
400
500
600
700
800
U
ii
s«
oa
ow
2
3
Z
z
a!
g
ft
w
P
4
5
S
a:
D
STATEM
<
ATEMEN
YPE 9
zE
UH
a
L
5
O
1ENT
z"
CO ft-
ACTION
2
id
Z
O
u.
8
gl
°!i
a<
6
7
8
9
<:
£
§
W O
10
11
12
2
4
4
5
5
504
1
7
6
401
801
102
2
302
302
3
302
2
302
302
1
802
102
102
303
303
3
303
3
303
303
1
803
202
4
4
5
5
504
4
7
6
701
804
202
104
104
5
5
505
5
7
6
701
805
202
106
106
106
6
506
6
7
6
601
806
202
107
107
107
7
507
7
507
506
1
807
Error
Error
Error
Error
Error
Error
Error
Error
message—statement out of order.
message—END line missing.
message—statement is illegal in BLOCK DATA.
message—null subprogram.
message if the statement is not labelled.
message—END not preceded by transfer of control.
message—subprogram contains no executable statement.
message—unrecognised statement.
Fig. 9.
Checking for the order of FORTRAN statements
The use of symbol-state tables
A suitable table for checking that this conforms to the
specification of ASI FORTRAN is shown in Fig. 9. In
this case the symbols are FORTRAN statements which
have already been checked using the lowest level tables.
The table in Fig. 9 has no subroutines, so the hundreds
digit is used to indicate action calls. The actions
required are shown beneath the table. A higher level
table still could be used in which the symbols are program segments, and in which a test is made that one and
only one main program, and not more than one BLOCK
DATA subprogram, occur. However, this can be done
more economically by keeping flags which are set by
action calls from the table in Fig. 9. The point is that
complex structures may be checked by using a hierarchy
of tables, thereby increasing their power.
Applications
From the examples given here it will be apparent that
symbol-state tables are a great help in designing syntax
checkers for a programming language. Compilers do
not do all the syntax checking which may be desired, as
they are almost always biased towards some manufacturer's version of the language, and hardly ever check
for an internationally agreed standard. Consequently
there arises a need for syntax checkers written in a high
level language (in order to be machine independent)
which test the syntax of programs in a high level language,
e.g. a FORTRAN program to check the syntax of a
FORTRAN program and flag every statement which
does not conform to the ASI standard. This could be
a great help in establishing whether or not a particular
program is machine independent. The size of such a
syntax checker can be greatly reduced by using symbolstate tables, without impairing the speed.
Another application has been that of processing text
material in a natural language. A project was undertaken at the Computer Centre of University College
London (under a grant from the School of Oriental and
African Studies) to produce the final volume of Sir
Ralph Turner's Comparative Dictionary of Indo-Aryan
Languages (1966). This was to be a phonetic analysis,
listing all those headwords which contained examples of
certain sequences of symbols. The 19,000 Sanskrit
headwords from the dictionary were punched on to
paper tape, then copied on to magnetic tape. The
Sanskrit is represented in the dictionary by a Romanised
transliteration which includes many 'diacritics', i.e.
marks above and below the letters. For computer input
these composite symbols had to be coded as the letter
339
followed by special characters representing the diacritics,
e.g.
t became T
§ became S/
a became A/=
A magnetic tape had to be produced for automatic
typesetting on which every printable symbol (composite
or not) had to be represented by an integer, e.g. t was
represented as 62, a as 13, and so on. The problem was
to recognise groups of characters in the input stream as
the corresponding printable symbols, and to do it
efficiently.
A symbol-state table was written with a subroutine
which built up the printable symbols. Each time a
letter was found in the input stream, the subroutine was
called. Action calls from the subroutine as the letter
and any following diacritics were processed, built up the
resulting single integer. In fact, as the subroutine was
called, a count was initialised to a certain value depending
on the letter which had been encountered. When a
diacritic was found, the count was incremented by an
amount depending on the diacritic. When a character
was encountered which could not be part of the printable
symbol, control was handed back from the subroutine,
and at the same time the resulting count was placed in
the output stream. This process enabled some proofreading to be performed, as certain invalid sequences of
characters were trapped by the symbol-state table.
Other possible uses of symbol-state tables which have
suggested themselves are pattern-matching, parsing, or
even (with suitable action calls) as the core of an interpreter. The great drawback to the tables presented here
is the labour of constructing them by hand. However,
it should not be too difficult to write programs which
will accept rules in an appropriate form, and which will
build a table to check statements according to those rules.
These tables provide a powerful technique which should
not be confined to compiler writers, or to those who
program in assembly languages.
Acknowledgements
My thanks are due to Paul A. Samet, who first introduced me to simple symbol-state tables. Alan Shaw has
assisted me under grants from the National Computing
Centre (for work on syntax checkers) and from the School
of Oriental and African Studies (for work mentioned
above).
References
CHAPIN, N. (1967). Parsing of Decision Tables, CACM, Vol. 10, No. 8, pp. 507-512.
GRIES, D. (1968). Use of Transition Matrices in Compiling, CACM, Vol. 11, No. 1, pp. 26-34.
FELDMAN, J., and GRIES, D. (1968). Translator Writing Systems, CACM, Vol. 11, No. 2, pp. 77-113.

Download Report

The use of symbol

Paperzz.com

Your Paperzz