COMPILER ORGANISATION --------------------- .OVERVIEW OF THE COMPILATION PROCESS ----------------------------------- ..INPUT TO THE COMPILER --------------------- The MIDL compiler (MIDL) takes as input, Interface Definition statements written in the Microsoft Interface Defintion Language. The input containing the statements is a file with the ".IDL" extension (called the IDL file). In addition an Attibute Configuration file with the ".ACF" extension (called the ACF file), is also used for input after the IDL file has been processed. ..THE COMPILER PASSES ------------------- The compiler is organised into 4 distinct passes m1,m2,m3,m4 (meaning midl 1, midl 2 and so on...). The responsibilites of the various passes is as follows: m1 : Parsing the idl file. m2 : Parsing the acf file. (optional) m3 : optimisation. (optional) m4 : code generation. In addition, a driver program handles parsing of the command line, and depending upon the user specified options, invokes the passes. The big picture for the compiler looks like this: (FIGURE) ..OUTPUT FROM THE COMPILER ------------------------ The output from the MIDL compiler is a set of ".c" and ".h" files containing the stub routines and prototypes. These files are then linked with the client modules of the app and server modules. APP.IDL will produce : app_c.c and app_c.h for client side : app_s.c and app_s.h for server side. .A LOOK AT THE COMPILER PASSES ----------------------------- ..THE COMPILER DRIVER (m0) --------------------- TBD ..PASS1 (m1) ------------ Pass 1 (m1) of the compiler is a parsing pass the input to which is a stream of token produced by the lexical analyser (lexer) The two main components of m0 thus, are the parser and the lexer. The lexer is encapsulated in a function called yylex(). The lexer is called by the parser to supply tokens as and when needed. It is a hand-coded lexer which operates off a state-transition table. Starting from state 0 , the lexer accepts a character and makes a transition to a new state depending upon the character. This process repeats till a whole token has been recognised. White spaces, comments and newlines are ignored by the lexer. Tokens returned by the lexer can be keywords, identifiers, numeric and string constants, and characters which form the syntax of the IDL. Keyword recognition is the responsiblity of the lexer. A keyword is recognised as an identifier intially. A preallocated keyword table is used to distinguish between keywords and identifiers. If an identifier is found in the keyword table it is returned as a keyword by the lexer. The parser is automatically generated by the yacc parser generator. The input to the parser generator is a source file containing grammar productions for the IDL . The output is a ".c" file which contains the parsing tables along with the driver routine for the parser. The parser is encapsulated in a function called yyparse(). The process of recognition of a valid syntactic construct consists of reductions of grammar rules specified in the grammar file (grammar.y). Interspersed with the grammar rules are action routines which are coded by the compiler writers and which are executed after the production that they are in gets reduced. Apart from parsing, m0 is responsible for building the type graph the type data base and the symbol table. The process of building the type graph and the symbol table is described elsewhere (WHERE??). Suffice it to say that as and when productions are recognised, the type graph and symbol table are built. As types are defined, they also get entered in the type data base. The type data base contains reference counts which indicate the usage of the type and are useful in determining the kind of marshalling that the type will undergo. Type sub-graphs corresponding to procedures (loosely classified as signatures) also get entered into the type data base. ..PASS2 (m2) ------------ Pass 2 is responsible for parsing the ACF file. This process follows the IDL parsing. The m0 checks for the presence of an ACF file and invokes pass 2 (m2) if needed. The parser for pass2 also is generated by the parser generator. The same lexical analyser is used even for pass 2. Since the parser generator generates the parser function yyparse(), it results in a name clash with the m1 parse. The proposed scheme for preventing the name clash is to redefine yyparse as (say) yy2parse and similarly renaming any global variables belonging to the parser module in an include file which is included when the generated parser is compiled. The m2 verifies that the acf conforms to the syntax and semantics of NIDL. m2 acts on the type graph generated by m1 and qualifies it with the attributes collected from the acf. No new data structures are introduced by m2. ..PASS3 (m3) ------------ Pass 3 is the optimisation pass. This pass determines the kind of marshalling a type or a procedure will undergo. Working off the type data base, m3 determines how types and procedures will be marshalled. Details of the optimisation are described in chapter (WHERE??) of this document. ..PASS4 (m4) ------------ m4 generates the stub code and include files for the application. Details of this pass are TBD. .CODE ORGANISATION ------------------ The above description decribes the logical layout of the passes. The physical code layout, generally parallels the logical layout, with some other considerations in mind. The most important consideration is that the compiler would be expected to work under the DOS , OS/2 and NT environments. For DOS, the size of the compiler code in memory is an important factor in deciding the layout of the compiler. More code residing in memory means lesser memory available for compiler internal storage and the smaller is the IDL file that can be compiled. Since all the passes are fairly independent of each other and only share the data structures among themselves, it is proposed that we overlay the various passes, with the driver (m0) residing in memory at all times and fanning out control. Overlaying is explicit under DOS with the modules partitioned into various overlays at link time. Under NT and OS/2 the same effect can be achieved by defining the various passes into segments which can be loaded on demand. Again, m0 is always loaded. Although overlaying is not of immediate concern, it serves well to be prepared for a quick change to this if needed. Keeping this in mind, it is proposed to organise the compiler passes such that each pass is (almost) self sufficient. Each compiler pass will be organised into modules, with one global entry point into the module. In other words, calls from one module to another WILL NOT (hopefully) happen. The advantage will be that in the overlayed scheme, overlay (or segment) swapping will not occur. This does not apply to routines like type graph driver/ symbol table driver routines which are globally accessed. Such routines will be part of the m0 module, which stays in memory all the time. Thus all except one routine defined in every module will be static. Following this discipline will mean that a change later to the overlayed scheme will be quick and painless. The other, though relatively less significant, advantage is that static routines within a module can be "near" routines thus saving code.