Saturday, February 18, 2012

A SPACE by any other name does not compile the same!

The title of this post is with my apologies to the Immortal Bard of Avon...

IBM's Application Programming Language Reference for COBOL defines the reserved word SPACE as "Represents one or more blanks or spaces; treated as a nonnumeric literal".  A bit further on in the manual we discover that "A nonnumeric literal is a character-string enclosed in quotation marks ("), and can contain any allowable character from the character set of the computer." We are also told that a nonnumeric literal can also be enclosed in apostrophes ('). And this is so, at least on the surface where programmers toil away creating their masterpieces. The weary programmer, faced with an inflexible deadline, might choose to code ' ' or " " instead of SPACE thereby saving keystrokes.

But some of the time the programmer saves in the writing is given back during the execution. Consider how the compiler (in this example COBOL 370 1.1.1) treats the following statements:

01  SOME-VARIABLES.
    05  PIC-X-1     PIC  X.
    05  PIC-X-30    PIC  X(30).

    MOVE ' '   TO PIC-X-1.
    MOVE SPACE TO PIC-X-1.

    MOVE ' '   TO PIC-X-30.
    MOVE SPACE TO PIC-X-30.

While each statement in the pairs of MOVE statements is functionally equivalent to the other, only the first pair generates identical object code. Here is the generated code as taken from the compile listing.

    MOVE ' '   TO PIC-X-1.
9240 A000               MVI  0(10),X'40'             
    MOVE SPACE TO PIC-X-1.
9240 A000               MVI  0(10),X'40' 

As the prosecutor in My Cousin Vinny would say... EYEdentical! The compiler recognizes PIC-X-1 as a one byte variable field and generates an MVI (Move Immediate) instruction to populate the field with the EBCDIC code for a space. 

When the variable field is more than one byte in length an MVI instruction cannot alone populate the entire field and the compiler generates different code. But now the constant is interpreted and treated differently:

    MOVE ' '   TO PIC-X-30.
9240 A001               MVI  1(10),X'40'
D21C A002 C000          MVC  2(29,10),0(12)

    MOVE SPACE TO PIC-X-30.
D21D A001 C000          MVC  1(30,10),0(12)

The first MOVE is straightforward, populate the first byte with an MVI instruction and then complete the  process by using an MVC (MOVE CHARACTER) instruction to move a constant composed of 29 spaces to the remainder of the field. The 2nd operand in the MVC instruction illustrates one of the efficiencies IBM introduced beginning with its COBOL II compiler. The compiler generates what are referred to as SYSLITs or System Literals and uses these constants to initialize fields greater than one byte in size with a single instruction. The SYSLIT in the second move is the same as that in the first case but the referenced length of 30 is one byte more than in the first move. Very interesting! The programmer saved keystrokes by coding ' ' instead of SPACE but that resulted in an increase in program execution time. The increase is admittedly small and the programmer's time is more expensive so they might be excused for writing this type of code. But why not issue a change all command to change every occurrence of ' ' to SPACE once the program is written? Makes sense does it not?

What would make even more sense, in my opinion, is for the compiler to recognize that ' ' (or " " depending on the APOST/QUOTE compiler option taken) is physically equivalent to SPACE and generate the same code in the second pair of MOVES. IBM does not check for what amounts to a special case of nonnumeric literal and thus misses an opportunity to further optimize the resulting program.

Why should IBM be concerned about optimizing the program in this instance? The programmer coded ' ' instead of SPACE and that is what they got. They asked that one space be moved to the field and they relied on the compiler to blank pad the remainder of the field if it was more than one byte in length.

Fair enough you say? I would say the same but for another observation I made when coding the example program for this blog posting. There is another way for the programmer to move spaces to a field, namely with the INITIALIZE statement (also introduced with COBOL II). The INITIALIZE statement causes fields to be initialized with the default value for each field's PICTURE specification. For PIC X that means spaces. Note that I did not write SPACES. I didn't use the reserved word SPACES because IBM does not use it.

Here is the generated code for INITIALIZE SOME-VARIABLES:

9240 A000               MVI  0(10),X'40'
9240 A001               MVI  1(10),X'40'
D21C A002 C000          MVC  2(29,10),0(12)

Apparently, the IBM compiler is "expanding" the INITIALIZE statement into the following individual MOVE statements:

    MOVE ' '   TO PIC-X-1.
    MOVE ' '   TO PIC-X-30.

rather than the following statements which would be more efficient:

    MOVE SPACE TO PIC-X-1.
    MOVE SPACE TO PIC-X-30.

Treating ' ' or " " (and also X'40') as identical to SPACES/SPACES does not seem to me to be such a difficult task, but then I'm not privy to all the myriad decisions made during the decades long history of COBOL compiler development.