Sunday, August 15, 2010

Who I am...

I've been digging into machine code for about 30 years.

It started in 1979 after I graduated college and began work as a systems programmer for AT&T General Departments. I was learning assembly language coding during the day and coincidentally taking an advanced undergraduate course in cryptology during the evening.

At some point I posed a question to Tom Storms, my mentor at the time, along the lines of "If the assembler creates machine code from source code then what program takes machine code and recreates the source code?" His reply was that there was no such program to do that and as mentors are sometimes wont to do he added that it couldn't be done. Now, having just gone through a course describing the myriad ways of encoding and enciphering messages and the equally myriad ways of decoding and deciphering those messages without benefit of code books or keys (aka cryptanalysis or codebreaking), I almost immediately took up Tom's implicit challenge. The result was a batch disassembler that examined the machine code version of a program and churned out an assembler source code file.

Upon showing my work to Tom, he mentioned that IBM had a debug facility that featured a command to disassemble the code. He then proceeded to show me how it worked and we discovered that, unlike my program, it could not distinguish base registers and therefore could not generate labels for branch points or operands and it could not process over constants embedded in the machine code!

Eventually, I used my disassembler on a COBOL program to see if the machine code version of that program matched the source code version. I ran the machine code version through the disassembler while the applications programmer recompiled the source code version with an option that produced a listing showing what machine code the compiler had generated. The versions did not match and this told the applications fellow what he needed to know. At that point it struck me that if I could take machine code and recreate assembler code what was to prevent me from taking the machine code and recreating COBOL code provided the executable program had originally been created from COBOL source code?

Assembler source code (with the exception of macros) generates a one for one correspondence between assembler instruction and machine instruction and thus is very much like enciphering. COBOL (and all other higher level languages) generates a one to many correspondence between COBOL statement and machine instruction and is more along the lines of encoding. All I need do is to analyze the patterns of machine code that COBOL statements generated and then match them to the machine code in order to recover the COBOL source code.

Thirty years on and I am still doing just that.

You can think of a COBOL compiler as a code book used to encode instructions for the computer. The programmer writes the instruction in language that a human can read and understand and then the compiler encodes those instructions into machine code which the computer reads and "understands." Simply stated, programmers work with the source code while computers work with the machine code. What happens if the machine code version of the program is lost? The programmer simply recompiles the source code. What happens if the source code is lost? The computer still runs the program but now the programmer has only an indecipherable version should changes be required.

Recovering the source code is the only way to put the programmer back in touch with and in control of the program.

Most interestingly (and oft times frustrating) for me is that the code book is constantly changing. New versions, releases and maintenance levels of the COBOL compiler continue to be released. New features and statements are added to the COBOL language. The practical implication for source recovery is that I must continually keep up with the changes to the code book.

It has been quite an adventure and I hope to share some of the discoveries I've made on the journey with all those sharing an interest in programming and cryptology.