5. The REXX/imc internal data structures

Contents

Introductory text

REXX/imc has several data structures including the program, the label
table, the symbol table, and the program stack.  Each is kept as a block of
"malloc"ed memory which is grown dynamically as necessary (the "growing" is
usually performed by the mtest and dtest macros, mentioned above).

(a) The Source

    The source code is stored in memory as it appears in the REXX program
    on disk, except that newline characters are translated into zero bytes.
    The address of each line (numbered from 1 to the number of lines in the
    program) is stored in an array (which is another block of "malloc"ed
    memory), and accessed using the construction "source[num]".  The
    zeroth element of source stores the file name of the source, which is
    stored in a separately allocated block of memory.  The block allocated
    for the source itself is guaranteed to start at source[1].

(b) The Program

    The "program", or preprocessed source, is stored as a list of
    instructions, and prog[num] is a structure containing details about the
    "num"th instruction.  There are "stmts" instructions in this list.  Note
    that "THEN", "ELSE" and "OTHERWISE" are counted as individual
    instructions, and that null clauses do not appear in the list of
    instructions.  prog[0] contains some pointers to the starts of things
    (namely the block allocated for the tokenised program itself, the block
    allocated for the source, and the beginning of the first comment in the
    source).  prog[stmts] is a null instruction which points to the end of
    the source.
    
    Each instruction is obtained from the source by removing all comments
    and labels; stripping all unnecessary blanks and collapsing multiple
    spaces into one; concatenating lines together whenever the continuation
    character "," appears; tokenising keywords into special characters, and
    translating "^" (the variant NOT operator) into "\" (the real NOT
    operator).  The source is also checked for mismatched quotes, invalid
    labels and invalid characters during preprocessing.

    Whereas "source" always contains the source of the program being
    interpreted, "prog" can sometimes contain the list of instructions
    obtained from an "interpret" instruction.  In this case, the source
    pointer in prog[0] is used instead of source[1] when freeing the source.
    Also, certain instructions (such as "call") are required to save the
    current program temporarily and go back to the original program, by
    searching for it on the program stack.

    The current instruction number within the program is stored in the global
    variable ppc.  The current character position within the program line is
    stored in a local variable, often called lineptr.

    Each prog[num] consists of the following structure, called "program":

       int num;         The line number in the source where the instruction
                        originates - used when tracing source instructions.
       char *source;    The start of this instruction in the source
       char *sourcend;  The end of this instruction in the source
       int related;     Reserved for future use
       char *line;      The address of the tokenised instruction

    The source between "source" and "sourcend" is the current instruction
    including any comments on the same line but not including labels.
    Comments on separate lines and labels will appear after "sourcend" or
    before "source".

(c) The Labels

    Labels within a program are separated off and stored in a table while
    the program is being preprocessed.  The table is stored in a block of
    memory at address labptr and of length lablen.  While the table is being
    built up the length of data so far is elapbtr.  The table consists of
    the concatenation of elements in the following format:

       int    total length of this element
       int    instruction number of label
       char[] name of label, terminated with a zero character and padded [*]

    The last element in the table is followed by the integer zero.

    [*] In all the data structures used by REXX/imc, variable-length string
    entries are padded (with random bytes) to a multiple of 4 bytes, in
    order that later data is aligned to the Sun SPARCstation's requirements.
    The alignment is done by macros in const.h and may be changed.

(d) The Calculator Stack

    The calculator stack is simply a list of values in temporary storage
    during the evaluation of an expression.  Expressions are in effect
    translated to reverse polish notation (e.g. 1+2 becomes 1 2 +), with
    each operation stacking and removing values as appropriate.  The stack
    is stored in a block of memory addressed by cstackptr and of cstacklen
    bytes.  The total size of the data on the stack is ecstackptr.  Each
    element on the stack consists of a string (padded as necessary [*])
    followed by its integer length.

(e) The Workspace

    The workspace is a block of memory which is used by various parts of the
    interpreter to store transient data.  It is addressed by workptr and is
    of length worklen.  Some routines (especially num()) maintain the length
    of the currently stored data in eworkptr.

(f) The Symbol Table

    The symbol table is a multiple-level associative array containing the
    definitions of all the REXX symbols which have been assigned values.  It
    is stored in a block of memory addressed by vartab and of length varlen.

    Each level of the table is a separate entity, referring to the active
    variables of a program fragment which used the PROCEDURE instruction to
    hide the earlier levels, except that each level may contain pointers to
    earlier levels as a result of the PROCEDURE EXPOSE instruction.

    The start of each level is stored in the array varstk[] as an integer
    offset from the start of the entire table.  The number of levels is
    varstkptr, and the number of elements in the array is varstklen.  The
    total length of all current levels of the symbol table is
    varstk[varstkptr+1].

    Each level of the symbol table contains all currently active simple
    symbols and stems as a binary search tree.  The elements of the tree are
    in the following format (as described by the varent data type):

       int    total length of this element
       int    position of the left child (as offset from start of level)
       int    position of the right child (as offset from start of level)
       int    length of the symbol's name
       int    amount of memory allocated to hold the symbol's value
       int    actual length of the symbol's value
       char[] the symbol's name, padded [*]
       char[] space for the symbol's value

    If the symbol is a stem, the terminating dot of the name is not stored,
    but the most significant bit of the first character of the name is set.

    If the length of the value is negative, then the symbol is considered
    to be undefined; it has been deleted by DROP or some other process.

    If the "amount of memory" element of the structure is negative, then
    this symbol is actually located in an earlier level; the negative value
    gives the level number, starting at -1 which means the earliest level
    (i.e. the first in the memory block).

    Each stem entry in the symbol table has a value containing a default
    value and a mini-symbol table.  The default value is stored first, in
    the format:

       int    the amount of space allocated to the default value
       int    the actual length of the default value
       char[] space for the default value

    Immediately following that are entries for all the tails, in exactly the
    same format as for the main symbol table.

(g) The Program Stack

    The program stack holds information relating to currently active control
    structures, e.g. function calls and DO-END blocks.  It is stored in a
    block of memory addressed by pstackptr and of length pstacklen.  The
    length of all data currently on the program stack is epstackptr.  The
    stack holds a sequence of various entries in the following formats, and
    the number of entries in the current function is held in pstacklev:

    for a simple DO-END block (type 0), a SELECT block (type 2),
    a DO WHILE or DO FOREVER block or (type 8), a "struct minstack" entry:

       int stmt    the statement number where the block started
       char *pos   the character where the block started (used for finding the
                   WHILE or UNTIL part of a repetitive DO instruction)
       int len     the length of this structure (16)
       int type    the type value (as mentioned above)

    for a repetitive DO with a control variable, an extended "struct
    minstack" entry:

       int stmt    the statement number where the block started
       char *pos   the character where the block started
       char[]      the limit value, padded [*] (empty string for "no limit")
       int         the length of the limit value
       char[]      the step value, padded [*]
       int         the length of the step value
       char[]      the name of the control variable, padded [*]
       int         the length of the control variable name
       int         the value which appeared after FOR, or -1 for "no value"
       int len     the total length of this structure
       int type    the number 10.

    for a repetitive block introduced by "DO count", a "struct forstack" entry:

       int stmt    the statement number where the block started
       char *pos   the character where the block started
       int fornum  the value specified after "DO"
       int len     the length of this structure (20)
       int type    the number 15.

    for an internal function call, a "struct procstack2" entry:

       int stmt    the statement number where the function call ocurred
       char *csp   the caller's cstackptr
       int ecsp    the caller's ecstaclptr
       int csl     the caller's cstacklen
       char trc    the caller's trace flag
       char tim    the caller's timestamp flag
       char form   the caller's NUMERIC FORM (0 for SCIENTIFIC)
                   the compiler will probably insert one byte here
       int digits  the caller's NUMERIC DIGITS
       int fuzz    the caller's "NUMERIC DIGITS minus NUMERIC FUZZ"
       long mic    the caller's timestamp microseconds value
       long sec    the caller's timestamp seconds value
       int address1 the caller's environment number (for ADDRESS)
       int address2 the caller's alternate environment number
       program *prg the program which was being interpreted at the time
       int stmts   the number of statements in that program
       int len     the length of this structure (60)
       int type    the number 11, becoming 12 when the PROCEDURE instruction
                   has been executed
    Note: the program is stored in case it was generated by "interpret"; if
    that is so then the "real" program will be reinstated before the call.

    for an INTERPRET instruction, a "struct interpstack" entry:

       int stmt     the statement number where the INTERPRET occurred
       program *prg the program which was being interpreted at the time
       int stmts    the number of instructions in that program
       int len      the length of this structure (20)
       int type     the number 14.

    for a command typed during interactive trace mode, a "struct
    interactstack" entry (note that a "struct interpstack" entry also
    appears above this):

       int stmt    the statement number where the interruption occurred
       char *csp   the interrupted program's cstackptr
       int ecs     the interrupted program's ecstackptr
       int csl     the interrupted program's cstacklen
       int len     the length of this structure (16)
       int type    the number 16.

    Occasionally, the interpreter stacks a program line with the sole intent
    of having it appear in the traceback.  Such an entry would be in the
    format of a "struct errorstack":

       int stmt     the statement number where the error occurred
       program *prg the program where the error occurred
       int stmts    the number of statements in this program
       int len      the length of this structure (20)
       int type     the number 20

    Note that external calls are no longer stored on the program stack.  For
    most purposes, they can be seen to be executed by a separate incarnation
    of the interpreter.

(h) The "Hash" Tables

    There are three "hash" tables, each arranged in a similar way to one
    level of the symbol table; however the value of each symbol in the hash
    table is a single void* pointer rather than a string of characters.
    There are three hash tables, each occupying a separate block of memory
    addressed by hashptr[i] and of length hashlen[i] (for i=0,1,2).  They
    are used to store environment variables, details of open files, and
    details about loaded functions respectively.  Each element of a hash
    table is arranged as follows (and as indicated in the "hashent" data
    type):

       int    length of this element
       int    position of the left child
       int    position of the right child
       void*  value of the element
       char[] name of the element, terminated by a zero byte and padded [*]

    The value of each element may be one of the following:

    In hash table 0:  a pointer to the string "NAME=VALUE" which has been
    used in a putenv() call.

    In hash table 1:  either a null pointer, or a pointer to a block of
    memory which contains a "struct fileinfo" followed by a zero-terminated
    filename (which may be empty if the file name is unknown).  The "struct
    fileinfo" is described in const.h.

    In hash table 2:  a pointer to a "funcinfo" structure containing the
    address of a loaded function.  One of the functions loaded from each
    function package also contains the handle which was returned from
    dlopen().

(i) The Signal Stack

    REXX stores the context of each invocation of the interpreter()
    function, so that it can be restored whenever an EXIT or a caught signal
    is encountered.  The context is stored in the array sigstack[], which at
    any time has sigstacklen elements allocated to it.  The number of the
    highest used element of sigstack[] is interplev.  Each entry of the
    signal stack contains:

       short bits    The combined bits for all SIGNAL ON traps which are on
       short bitson  ...all SIGNAL ON traps which were executed at this level
       short callon  ...all CALL ON traps which are on
       short delay   ...all conditions which are delayed
       char type     1 if a "signal" trap just occurred, 2 if "call", 0 if none
       char which    The number for the condition which just occurred (if any)
       char *data    A description for for the condition which just occurred
       int ppc[6]    The statement numbers to jump to for all the conditions
                     or minus the statement number to flag with "label not
		     found" if a condition occurs
       jmp_buf jmp   The context of this level of the interpreter.

    Note: a negative number in ppc[i] means that the trap instruction for
    this condition named a non-existent label.  -ppc[i] will be the
    statement number of this trap instruction.  This means that we can
    delay the "label not found" error until it actually happens: moreover,
    we can include the condition trap instruction in the traceback.
    However, an interpreted condition trap instruction is not guaranteed to
    stay around, and therefore it is reported as an error immediately if it
    names a non-existent label.

(j) The Shell's Hash Table

    The builtin shell keeps a record of where it found each command in a
    hash table (yes, a real one ;-) ).  The variable "hashtable" points to
    an array of bucket pointers, and the variable "hashbuckets" holds the
    number of pointers in the table.  A bucket pointer is null if there are
    no entries in the bucket; otherwise it points to a hashitem structure,
    representing the first item in the bucket.  That item may in turn point
    to another item in the bucket.  The items in each bucket are kept in
    alphabetical order, to reduce the average time taken for an unsuccessful
    search (or an insertion).  Each item contains the following information:

       struct _hashitem *next  The next item in the bucket, or null
       int hits                Number of times this has been found
       int expense             Position within $PATH
       int dot                 Whether dot occurred in the path before this
       int data                Offset from end of header to data
       char[]                  The key; a string of characters ending with 0.
       char[]                  The data; another nul-terminated string.