The title of this post is with my apologies to the Immortal Bard of Avon...
IBM's Application Programming Language Reference for COBOL defines the reserved word SPACE as "Represents one or more blanks or spaces; treated as a nonnumeric literal". A bit further on in the manual we discover that "A nonnumeric literal is a character-string enclosed in quotation marks ("), and can contain any allowable character from the character set of the computer." We are also told that a nonnumeric literal can also be enclosed in apostrophes ('). And this is so, at least on the surface where programmers toil away creating their masterpieces. The weary programmer, faced with an inflexible deadline, might choose to code ' ' or " " instead of SPACE thereby saving keystrokes.
But some of the time the programmer saves in the writing is given back during the execution. Consider how the compiler (in this example COBOL 370 1.1.1) treats the following statements:
01 SOME-VARIABLES.
05 PIC-X-1 PIC X.
05 PIC-X-30 PIC X(30).
MOVE ' ' TO PIC-X-1.
MOVE SPACE TO PIC-X-1.
MOVE ' ' TO PIC-X-30.
MOVE SPACE TO PIC-X-30.
While each statement in the pairs of MOVE statements is functionally equivalent to the other, only the first pair generates identical object code. Here is the generated code as taken from the compile listing.
MOVE ' ' TO PIC-X-1.
9240 A000 MVI 0(10),X'40'
MOVE SPACE TO PIC-X-1.
9240 A000 MVI 0(10),X'40'
As the prosecutor in My Cousin Vinny would say... EYEdentical! The compiler recognizes PIC-X-1 as a one byte variable field and generates an MVI (Move Immediate) instruction to populate the field with the EBCDIC code for a space.
When the variable field is more than one byte in length an MVI instruction cannot alone populate the entire field and the compiler generates different code. But now the constant is interpreted and treated differently:
MOVE ' ' TO PIC-X-30.
9240 A001 MVI 1(10),X'40'
D21C A002 C000 MVC 2(29,10),0(12)
MOVE SPACE TO PIC-X-30.
D21D A001 C000 MVC 1(30,10),0(12)
The first MOVE is straightforward, populate the first byte with an MVI instruction and then complete the process by using an MVC (MOVE CHARACTER) instruction to move a constant composed of 29 spaces to the remainder of the field. The 2nd operand in the MVC instruction illustrates one of the efficiencies IBM introduced beginning with its COBOL II compiler. The compiler generates what are referred to as SYSLITs or System Literals and uses these constants to initialize fields greater than one byte in size with a single instruction. The SYSLIT in the second move is the same as that in the first case but the referenced length of 30 is one byte more than in the first move. Very interesting! The programmer saved keystrokes by coding ' ' instead of SPACE but that resulted in an increase in program execution time. The increase is admittedly small and the programmer's time is more expensive so they might be excused for writing this type of code. But why not issue a change all command to change every occurrence of ' ' to SPACE once the program is written? Makes sense does it not?
What would make even more sense, in my opinion, is for the compiler to recognize that ' ' (or " " depending on the APOST/QUOTE compiler option taken) is physically equivalent to SPACE and generate the same code in the second pair of MOVES. IBM does not check for what amounts to a special case of nonnumeric literal and thus misses an opportunity to further optimize the resulting program.
Why should IBM be concerned about optimizing the program in this instance? The programmer coded ' ' instead of SPACE and that is what they got. They asked that one space be moved to the field and they relied on the compiler to blank pad the remainder of the field if it was more than one byte in length.
Fair enough you say? I would say the same but for another observation I made when coding the example program for this blog posting. There is another way for the programmer to move spaces to a field, namely with the INITIALIZE statement (also introduced with COBOL II). The INITIALIZE statement causes fields to be initialized with the default value for each field's PICTURE specification. For PIC X that means spaces. Note that I did not write SPACES. I didn't use the reserved word SPACES because IBM does not use it.
Here is the generated code for INITIALIZE SOME-VARIABLES:
9240 A000 MVI 0(10),X'40'
9240 A001 MVI 1(10),X'40'
D21C A002 C000 MVC 2(29,10),0(12)
Apparently, the IBM compiler is "expanding" the INITIALIZE statement into the following individual MOVE statements:
MOVE ' ' TO PIC-X-1.
MOVE ' ' TO PIC-X-30.
rather than the following statements which would be more efficient:
MOVE SPACE TO PIC-X-1.
MOVE SPACE TO PIC-X-30.
Treating ' ' or " " (and also X'40') as identical to SPACES/SPACES does not seem to me to be such a difficult task, but then I'm not privy to all the myriad decisions made during the decades long history of COBOL compiler development.
COBOL and Software Archaeology
Thoughts about the arcane niche of software archaeology. Just what happens when source code is processed by IBM language compilers, assemblers, and pre-processors and how to reconstruct lost source code from the zeros and ones that comprise executable code.
Saturday, February 18, 2012
Friday, December 24, 2010
COBOL/370 compile with OPTIMIZE and TEST?
I'm just now recovering a program that was originally compiled using COBOL/370 V1.2.0.
The signature indicates that OPTIMIZE and TEST were both specified as options to the compiler and analysis of the load module confirms that SYM records are present (confirming the TEST option) and analysis of the CSECT confirms that procedure division code has been re-located (confirming the OPTIMIZE option).
Currently I do not have COBOL/370 V1.2.0. I only have V1.1.1 and I have compiled the recovered source with this version only to receive this warning message:
IGYOS4022-W The "OPTIMIZE" option was discarded due to option conflict resolution. The "TEST" option from "PROCESS/CBL" statement took precedence.
So what exactly is going on here?
Did IBM slip up with V1.2.0 and allow both OPT and TEST?
Did I interpret the option bits in the program signature incorrectly? It would certainly be nice to have the V1.2.0 compiler and to confirm that it allowed OPT and TEST concurrently.
It would also be nice to understand why OPT and TEST are deemed to be in conflict since the program being recovered has apparently been running just fine for the past 12 years.
The signature indicates that OPTIMIZE and TEST were both specified as options to the compiler and analysis of the load module confirms that SYM records are present (confirming the TEST option) and analysis of the CSECT confirms that procedure division code has been re-located (confirming the OPTIMIZE option).
Currently I do not have COBOL/370 V1.2.0. I only have V1.1.1 and I have compiled the recovered source with this version only to receive this warning message:
IGYOS4022-W The "OPTIMIZE" option was discarded due to option conflict resolution. The "TEST" option from "PROCESS/CBL" statement took precedence.
So what exactly is going on here?
Did IBM slip up with V1.2.0 and allow both OPT and TEST?
Did I interpret the option bits in the program signature incorrectly? It would certainly be nice to have the V1.2.0 compiler and to confirm that it allowed OPT and TEST concurrently.
It would also be nice to understand why OPT and TEST are deemed to be in conflict since the program being recovered has apparently been running just fine for the past 12 years.
Sunday, August 15, 2010
Who I am...
I've been digging into machine code for about 30 years.
It started in 1979 after I graduated college and began work as a systems programmer for AT&T General Departments. I was learning assembly language coding during the day and coincidentally taking an advanced undergraduate course in cryptology during the evening.
At some point I posed a question to Tom Storms, my mentor at the time, along the lines of "If the assembler creates machine code from source code then what program takes machine code and recreates the source code?" His reply was that there was no such program to do that and as mentors are sometimes wont to do he added that it couldn't be done. Now, having just gone through a course describing the myriad ways of encoding and enciphering messages and the equally myriad ways of decoding and deciphering those messages without benefit of code books or keys (aka cryptanalysis or codebreaking), I almost immediately took up Tom's implicit challenge. The result was a batch disassembler that examined the machine code version of a program and churned out an assembler source code file.
Upon showing my work to Tom, he mentioned that IBM had a debug facility that featured a command to disassemble the code. He then proceeded to show me how it worked and we discovered that, unlike my program, it could not distinguish base registers and therefore could not generate labels for branch points or operands and it could not process over constants embedded in the machine code!
Eventually, I used my disassembler on a COBOL program to see if the machine code version of that program matched the source code version. I ran the machine code version through the disassembler while the applications programmer recompiled the source code version with an option that produced a listing showing what machine code the compiler had generated. The versions did not match and this told the applications fellow what he needed to know. At that point it struck me that if I could take machine code and recreate assembler code what was to prevent me from taking the machine code and recreating COBOL code provided the executable program had originally been created from COBOL source code?
Assembler source code (with the exception of macros) generates a one for one correspondence between assembler instruction and machine instruction and thus is very much like enciphering. COBOL (and all other higher level languages) generates a one to many correspondence between COBOL statement and machine instruction and is more along the lines of encoding. All I need do is to analyze the patterns of machine code that COBOL statements generated and then match them to the machine code in order to recover the COBOL source code.
Thirty years on and I am still doing just that.
You can think of a COBOL compiler as a code book used to encode instructions for the computer. The programmer writes the instruction in language that a human can read and understand and then the compiler encodes those instructions into machine code which the computer reads and "understands." Simply stated, programmers work with the source code while computers work with the machine code. What happens if the machine code version of the program is lost? The programmer simply recompiles the source code. What happens if the source code is lost? The computer still runs the program but now the programmer has only an indecipherable version should changes be required.
Recovering the source code is the only way to put the programmer back in touch with and in control of the program.
Most interestingly (and oft times frustrating) for me is that the code book is constantly changing. New versions, releases and maintenance levels of the COBOL compiler continue to be released. New features and statements are added to the COBOL language. The practical implication for source recovery is that I must continually keep up with the changes to the code book.
It has been quite an adventure and I hope to share some of the discoveries I've made on the journey with all those sharing an interest in programming and cryptology.
It started in 1979 after I graduated college and began work as a systems programmer for AT&T General Departments. I was learning assembly language coding during the day and coincidentally taking an advanced undergraduate course in cryptology during the evening.
At some point I posed a question to Tom Storms, my mentor at the time, along the lines of "If the assembler creates machine code from source code then what program takes machine code and recreates the source code?" His reply was that there was no such program to do that and as mentors are sometimes wont to do he added that it couldn't be done. Now, having just gone through a course describing the myriad ways of encoding and enciphering messages and the equally myriad ways of decoding and deciphering those messages without benefit of code books or keys (aka cryptanalysis or codebreaking), I almost immediately took up Tom's implicit challenge. The result was a batch disassembler that examined the machine code version of a program and churned out an assembler source code file.
Upon showing my work to Tom, he mentioned that IBM had a debug facility that featured a command to disassemble the code. He then proceeded to show me how it worked and we discovered that, unlike my program, it could not distinguish base registers and therefore could not generate labels for branch points or operands and it could not process over constants embedded in the machine code!
Eventually, I used my disassembler on a COBOL program to see if the machine code version of that program matched the source code version. I ran the machine code version through the disassembler while the applications programmer recompiled the source code version with an option that produced a listing showing what machine code the compiler had generated. The versions did not match and this told the applications fellow what he needed to know. At that point it struck me that if I could take machine code and recreate assembler code what was to prevent me from taking the machine code and recreating COBOL code provided the executable program had originally been created from COBOL source code?
Assembler source code (with the exception of macros) generates a one for one correspondence between assembler instruction and machine instruction and thus is very much like enciphering. COBOL (and all other higher level languages) generates a one to many correspondence between COBOL statement and machine instruction and is more along the lines of encoding. All I need do is to analyze the patterns of machine code that COBOL statements generated and then match them to the machine code in order to recover the COBOL source code.
Thirty years on and I am still doing just that.
You can think of a COBOL compiler as a code book used to encode instructions for the computer. The programmer writes the instruction in language that a human can read and understand and then the compiler encodes those instructions into machine code which the computer reads and "understands." Simply stated, programmers work with the source code while computers work with the machine code. What happens if the machine code version of the program is lost? The programmer simply recompiles the source code. What happens if the source code is lost? The computer still runs the program but now the programmer has only an indecipherable version should changes be required.
Recovering the source code is the only way to put the programmer back in touch with and in control of the program.
Most interestingly (and oft times frustrating) for me is that the code book is constantly changing. New versions, releases and maintenance levels of the COBOL compiler continue to be released. New features and statements are added to the COBOL language. The practical implication for source recovery is that I must continually keep up with the changes to the code book.
It has been quite an adventure and I hope to share some of the discoveries I've made on the journey with all those sharing an interest in programming and cryptology.
Subscribe to:
Posts (Atom)