Automatic GUI Creation
Generating Java source code from formal descriptions

2. Analyzing Element Descriptions

In this section the structure and analysis of the formal element descriptions is explained. The section consists of four parts—the three first containing background material. The first part introduces the element descriptions and their syntax. The second part explains how grammars work and how they can be used to analyze a character stream. The third part deals with implementing such an analysis in Java. The last part, finally, reveals the techniques used and design decisions made in the prototype analyzer.

2.1 Formal element descriptions

Most of the elements in the Ericsson IP telephony system are configured and managed remotely with the Simple Network Management Protocol (SNMP, Case et al., 1990). SNMP is most commonly used to configure routers, gateways and other network elements on the Internet, and is an Internet Activities Board recommendation described in RFC 1157.

The SNMP standard requires all manageable elements to have their interface described formally in a Manageable Information Base (MIB), a format described in RFC 1155 (Rose & McCloghrie, 1990). Each MIB typically contains a list of field and table names associated with their data types, access conditions and unique identifiers.

The ASN.1 syntax

The MIB files are normal text files written in Abstract Syntax Notation One (ASN.1, ISO 8824), an ISO standard for declaring variables, composite types and other structures. Actually only a subset of the full ASN.1 syntax is used in the MIB files, as defined in RFC 1155 (Rose & McCloghrie, 1990). The fields must also be declared using the macros defined in RFC 1212 (Rose & McCloghrie, 1991) and RFC 1215 (Rose, 1991).

The ASN.1 language is not a full programming language, but only contains type and structure declarations. Specialized encoders then use the structural declarations when encoding or decoding the actual data, the idea being that declarations and low-level representations should be separated. Currently there exists a number of such encoders, the most commonly undoubtedly being the Basic Encoding Rules (BER) used with SNMP.

Two simple ASN.1 examples

The syntax of ASN.1 is somewhat different from that of a normal programming language, as can be seen in example 1. The example shows a simple declaration, written with the RFC 1212 extensions to the pure syntax. It is worth noting the lack of delimiter characters, something that makes recovery from input errors more difficult.

cfgNumberLength OBJECT-TYPE
        SYNTAX INTEGER
        ACCESS read-write
        STATUS mandatory
        DESCRIPTION
        "The number of digits that must follow
        the VG route number in order to form a
        full number."
        DEFVAL { 3 }
::= { config 1 }

Example 1. Declaration of an integer field cfgNumberLength with a default value of 3. The code also contains the access rights, the implementation status, and a short description. Finally a unique identifier is defined, in order to avoid referring to the field by its name.

A more complicated declaration is shown in example 2, at least from the perspective of automatic code generation. The field is declared as having a data representation that does not correspond to the actual data content, causing the automatic code generator to select an inadequate or ill suited visual presentation. Overcoming such difficulties is hardly possible without the usage of sophisticated heuristics or full natural language processing.

cfgLGK1 OBJECT-TYPE
        SYNTAX DisplayString
        ACCESS read-write
        STATUS mandatory
        DESCRIPTION
        "This is the IP address of the primary Local
        Gatekeeper used by this voicegateway.
        The address takes the form X.X.X.X where
        X is an integer in the range of 0 and 255."
        DEFVAL { "" }
::= { config 2 }

Example 2. Declaration of a normal string field cfgLGK1 containing characters. The string contents should be interpreted as forming a valid IP address, something only mentioned in the description. This is a typical case when the automatic code generation will not be able to present the field data in an adequate way.

Some disadvantages with ASN.1

ASN.1 has several disadvantages compared to structure declarations in common programming languages, such as C or Pascal. The declarations tend to be large, non-compact and cluttered with keywords. Much repetition is also required for composite structures, as each part is declared twice—first with name and type inside the composite structure, and then afterwards with all the required parameters.

The typical ASN.1 file is rather large, which stems in part from the use of long keywords like OBJECT-TYPE, SYNTAX and DESCRIPTION—being easy to read for the untrained eye, but requiring much unnecessary and error-prone copying when writing the MIB files.

More implementation relevance has the free ordering of the constructs in ASN.1, something that cause much trouble when analyzing MIB files. The analysis must normally be made in at least two passes, as identifiers may be read before being defined or declared.

Another serious implementation problem is the fact that the basic types have no size limitation to their contents. Ranges can optionally be specified, but if no limit is specified the ranges are always supposed to be infinite. The use of infinite integers, infinite precision floats, and infinite length strings inevitably cause implementation difficulties, as all actual programming languages specify limits to their basic types. If special care is not taken in all steps of implementation, the result may well be hidden size dependencies that can be hard to track down.

2.2 Grammars and parsing

In order to analyze a stream of characters a special technique, called parsing, is normally used. The parsing is divided into three steps, or passes—lexical analysis, syntactic analysis, and semantic analysis. In the lexical analysis, adjacent characters are grouped together into tokens, more or less like letters are grouped together into words when we read a normal text ^{[Note 1]}. In the syntactic analysis the stream of tokens is analyzed according to a grammar and a parse tree, i.e. a hierarchical tree grouping the tokens, is created. In the semantic analysis, finally, the information is extracted from the parse tree and converted into some internal representation. During all three stages errors in the input can be detected and reported.

Grammars and productions

The grammar is absolutely central to the parsing as it controls the input syntax and the structure of the parse tree (sometimes also called syntax tree). The choice of grammar also controls the parser behavior, as different grammar formats implies different implementation techniques. But to understand this we must first learn more about grammars.

A grammar is said to consist of a series of productions, i.e. rules, controlling the structure of the input. As an example a sentence production could state that a sentence should consist of a noun-phrase, a verb-phrase and a period (see example 3). These parts in turn can be either tokens or references to other productions in the same grammar. The complete grammar consists of all the productions and tokens used therein.

Sentence ::= NounPhrase VerbPhrase "."

Example 3. A BNF description of a sentence consisting of a noun-phrase, a verb phrase, and a period. “Sentence” is the production being defined. “NounPhrase” and “VerbPhrase” are names of other productions (not shown here), and “.” is a reference to a period token.

The grammar itself is normally expressed in BNF (Backus-Naur Form), which is a syntax for specifying context free grammars. Each production in BNF consists of a two parts—the production name and it’s definition—separated by a “::=” symbol. The definition of the production (to the right of this symbol) consists of the sequence of tokens and productions that defines it.

The productions can also be more complicated, as would be the case of a paragraph that consists of one or more sentences. In this case two alternative productions would be needed to express the full relation, as shown in example 4. The productions may also refer to themselves in various ways.

Paragraph ::= Paragraph Sentence
            | Sentence

Example 4. Two productions expressing that a paragraph must consist of one or more sentences. Note the usage of “|” to simplify writing alternative productions with the same name.

Different types of grammars

There are various ways to express the grammar, even if using only strict BNF. One of the major differences is if the productions in the grammar are left- or right-recursive, i.e. if the productions may refer to themselves in the first or in the last position. Other differences include the number of tokens that have to be known in advance to select among the alternative productions, and whether the input string should be read from left-to-right or from right-to-left.

The two main families of grammars are the LL and LR grammars; both being read from left-to-right (which is what the first L means). The LL grammars are right-recursive, and in contrast the LR grammars are left-recursive. This is shown more clearly in example 5.

Paragraph ::= Paragraph Sentence
            | Sentence

Paragraph ::= Sentence Paragraph
            | Sentence

Example 5. Two different possibilities to express the paragraph rule from example 4. The first version is a LR form of the rule, while the second is the corresponding LL form.

The number of “look-ahead” tokens needed to choose among the alternative productions with the same name is normally stated in a parenthesis after the grammar family, as in LL(1), for a LL grammar with one token look-ahead. When specifying this number, the whole grammar must be analyzed and the highest number of tokens needed is the number that should be stated.

The LR grammars may still contain parts causing conflicts between alternative productions, at least for the type of parsers in use today. Therefore a subset of LR, called LALR, that contains some small limitations is normally used instead. As implementation complexity also rises with the usage of more look-ahead tokens, the two types of grammars most used in practice are LL(1) and LALR(1).

More ASN.1 disadvantages

The standard ASN.1 grammar is neither LL(1) nor LALR(1), normally requiring some rewriting before a parser can be implemented (Rinderknecht, 1995). The main problem is the ambiguities—the need for a look-ahead of several tokens, some identical tokens having different names, and the use of empty productions.

ObjIdComponent    ::= NameForm
                    | NumberForm
                    | NameAndNumberForm

NameForm          ::= <identifier>

NameAndNumberForm ::= <identifier> "(" NumberForm ")"

Example 6. A part of the original ASN.1 grammar showing the need for more than one token look-ahead. When inside the ObjIdComponent production and having an identifier as the next token, two rules may be chosen.

The usage of a look-ahead of several tokens is shown in example 6 and causes much trouble when attempting to implement a simple parser. More care has to be taken when creating a parser that handles a look-ahead of more than one token, and it also slows down parsing considerably.

Two other issues in the ASN.1 syntax that cause problems are the difficulty to recover from errors and the language extensibility. In many languages declarations are separated with the “;” character, allowing the parser to read until the next “;” upon error in the input. In ASN.1 the only separation is spaces, tabs and newlines, making error recovery difficult. The ASN.1 syntax can also be dynamically extended as macros are defined, something having the potential to make the parser infinitely complex, should one try to implement it. Most commercial ASN.1 parsers seem to contain only a set of macros, but ignoring the full macro syntax.

2.3 Writing a parser

There are several approaches to writing a parser, but the most common is undoubtedly to use a parser generator, i.e. a program that from a grammar can generate the source code of the parser. Two of the most popular parser generators are the lex and yacc tools (or the GNU equivalents flex and bison) that generate C source code for the lexical and syntactic analyzer. The lex and yacc tools also exist in versions that generate Java source code (JLex and Cup), but these versions offer no integration between the lexical and syntactical analyzers (Berk, 1997).

The classical tools for parser generation also have several other problems, especially when used in an object-oriented environment. For an example, they don’t support automatic parse tree generation, something that would have required too much memory at the time when they were developed. Today, however, memory is not a scarce resource, and multiple pass compilers have become more common. As a result of the lack of parse tree creation, semantic analysis and action code must be inserted in the grammar, making the separation between syntactic and semantic analysis hard to impossible. This mixing up of the later stages in the parsing typically causes problems with maintainability, extensibility and complexity.

Object-oriented parsing

In a modern object oriented language a different approach is called for, especially one that uses the object oriented techniques for encapsulation of data and functions in separate objects. It is also desirable to separate the syntactic and the semantic analysis in modules, making it possible to do small syntax changes without having to change the semantic analysis, and, more important, being able to change the analysis without regenerating the entire parser.

One approach to obtain a more object-oriented parser, is by using delegating compiler objects (Bosch, 1996). The grammar is then split into smaller parts, each part being handled by a separate compiler (or parser) object. When parsing the control is delegated between the various objects, effectively obtaining an object-oriented modularization.

The approach with delegating compiler objects makes the grammar more extensible, as only small changes must be made to add a new parser object. Changes in the compiler are also made simpler and safer, as modules are smaller and easier to overview. The main drawbacks are that no special division between syntactic and semantic analysis is made, and that no such parser generator tools are available for Java.

Another way to create an object-oriented parser is by generating a full parse tree, thereby being able to separate the syntactic and semantic analysis. The object-oriented paradigm also supplies a natural and secure model for the parse tree implementation, with each node as a separate object. This allows for more secure changes in the tree structure, reducing the risk for inconsistencies and errors.

JavaCC

The most commonly used parser generator for Java is probably JavaCC (previously called Jack), made freely available for binary downloads by Sun Microsystems (Sun, 1999). It generates a recursive-descent parser in Java from a LL(n) grammar, allowing look-ahead of multiple tokens when choosing productions. The grammar is written in a special format that allows Java code to be added to the grammar productions, much in the same way as when using yacc. It is also possible to define grammar rules in the form of pure Java code, although this is not a recommended practice.

The parser generated is divided into several classes that may be replaced or extended by the user. It is also reasonably fast, and generates understandable messages upon errors in the input. However, there is only a rudimentary support for error recovery during parsing. The parser will generally fail upon the first error, making it difficult to report several errors in a single iteration.

A more fundamental problem with JavaCC, though, is the lack of support for generating parse trees. When inserting the parse tree generation code into the grammar files, the latter becomes messy and cluttered with Java source code all over. This design choice is probably an inheritance from the earlier, more primitive parser generator tools for imperative languages and older systems.

There are a couple of tools that can add code for parse tree generation to the JavaCC grammar files automatically, but none of them have proved very convincing when used with the ASN.1 grammar. In general the parse trees produced are not strongly typed, difficult to extend, and not adapted to the application at hand. The extra step when generating the parser is also irritating, and may cause some problems with reserved keywords, etc.

SableCC

A more modern approach to parser generation has been taken by SableCC, a parser generator developed by Éntienne Gagnon at the McGill University (Gagnon, 1998). SableCC is available both as binary and source code, and is released under the GNU GPL license, thereby allowing others to modify and further develop the code.

SableCC generates a table-driven bottom-up parser for a LALR(1) grammar, just as the classical lex and yacc tools. It has many advantages compared to JavaCC, most notably being the automatic generation of the parse tree. The generated tree is also strongly typed, i.e. each grammar production is mapped onto a separate class. The child nodes can thus be accessed with specific methods in each class, having the name of the child node type. The parse tree generated is thus more robust to syntax changes, as node positions in productions may be changed freely.

The current version of SableCC has a number of disadvantages, though. The grammar can only contain lowercase production and token names, the token definitions are complex, and only strict LALR(1) grammars are accepted. Furthermore there is no support for error recovery, and the error messages generated during parsing are cryptic and filled with implementation details. Finally, as a result of the creation of named access methods for all child nodes in the parse tree, all alternatives in the grammar must be tagged with additional names to become uniquely identifiable. This includes both productions having the same name, and multiple identical references within a production. The usage of such tags in the grammar files can sometimes make them difficult to read.

2.4 The prototype MIB parser

When implementing the final parser for the prototype several small tests with both JavaCC and SableCC were performed, in order to decide which one to use. JavaCC was finally chosen as it would be easier to use and produce better error messages. Using SableCC would have required too much rewriting of the grammar, causing undesired complexity where none would have been needed, and it would also have rendered the parser less usable, as the error messages would have become more cryptic and difficult to understand.

The choice of JavaCC also came with the additional cost of implementing the parse tree generation, as it was important to attempt to split the syntactic and the semantic analysis into separate modules. Apart from this the grammar had to be rewritten and a full semantic analysis with error checking and information extraction had to be implemented. Rewriting the grammar One of the first problems encountered when implementing the parser was the lack of a good and freely available ASN.1 grammar. The first prototype was based on a grammar extracted from the lex and yacc sources of Snacc, a freely available ASN.1 compiler. Later the complete ASN.1:90 grammar (in Rinderknecht, 1995) was encountered and the necessary changes to correct some errors and add missing constructs could be made.

In order to create a reasonable simple ASN.1 parser, some parts of the grammar were rewritten—removing the possibility to define macros, but adding two widely used SNMP macros. The more compact EBNF syntax was used, instead of the standard BNF, in order to save some productions. The resulting grammar (the one implemented) can be found in appendix A.1. Syntax analysis and parse tree generation The rewritten grammar also had to be converted to the special format used by JavaCC, and Java code to create the parse tree had to be added. Basically semantic action code was added to each production in the grammar, creating parse tree nodes as each production is parsed. Only the nodes needed in the later stages of analysis were included in the generated parse tree, thereby effectively pruning the tree already upon creation.

The solution to create the parse tree “by hand” inside the syntactic analysis became necessary when using JavaCC, but proved to generate a syntax analyzer that was hard to read and modify. The parse tree, on the other hand, became well adjusted to the task at hand and could easily be modified when needed.

The parse tree framework was strongly influenced by the models found in SableCC (Gagnon, 1998, p. 52ff). Each node in the parse tree was represented by a separate object, and support for parse tree walker (or iterator) classes was also implemented. Having separate tree walker classes made the implementation of a multi-pass semantic analyzer fairly simple and straightforward, as all the details of tree walking had been isolated in separate classes.

One major difference in the SableCC model, however, is that the parse trees use strong typing, i.e. each production and token type is implemented in a separate class. The prototype parse trees, on the other hand, do not use strong typing, as generating that number of classes by hand would have been a too repetitive and error-prone work. Instead each node is assigned a unique number representing it’s type, and child nodes are accessed through general-purpose methods. Semantic analysis and data structures After the creation of the parse tree structure, the semantic analysis processes the tree in two passes. In the first pass all symbols (field names, types, etc) are identified and created, and in the second pass they are filled with actual information. This division was needed as symbols may be used before being declared.

The symbols created during the semantic analysis were modeled closely after the ASN.1 symbol structures, making the analyzer fairly simple and easy to implement. This design choice proved to have disadvantages later on, as it became difficult to separate different types of symbols from each other. Also some special ASN.1 deficits were inherited, as for example the table symbols having only indirect access to the column data types.

Some changes and additions were made to the original model as the prototype developed, but these were not quite enough to make symbol handling easy in the rest of the prototype. There is still need for a redesigned symbol structure, especially one where different types of symbols are split into a class hierarchy. The creation of these more specialized symbols should probably be placed as a third pass in the semantic analysis, converting all the old symbols to more specialized ones.

Notes

[Note 1] This is intended to reflect an intuitive idea on how we read texts, and is not to be understood as a proven fact. Psychological studies, on the contrary, show that the reading process is mostly based on direct image recognition of the words. This is, however, outside the scope of this thesis.

Automatic GUI Creation Generating Java source code from formal descriptions