Given the following assembler code, we never have seen before:
push 34h call check_stacksize push ebx push ecx push esi push edi push ebp mov ebp, esp sub esp, 1Ch mov esi, eax mov [ebp+dt], edx cmp [eax+person.index], 0 jl short 1BBA5 cmp [eax+person.name], 0 jz short 1BB6B
What does it do and how do we find out? The code snippet is quite simple, so we could simply trace it ourselves, following the flow of control through the jumps and calls and deduct from that, what its purpose is and how it works in detail. As you can see, we have some semantic information („person.index“, „person.name“) which is crucial to understand what the code does.
But what do we do if your program gets more and more complicated, something like in Figure 1? It will be much much more complicated to deduct the purpose and function of the code there. Why?
Let’s look at how we normally approach the question „what does this code do?“ (independent of whether it is written in Assembler, Basic or any other language).
The first aspect is the purpose of a piece of code. Because we can assume that it was written by a human being, we can also assume that the structure of the program – even if the compiler did heavy optimization – still reflects the author’s approach to the problem. There will propably be a modularization, a separation of code by task and timing. Even in assembler code, statements following each other will – most of the time – have something to do with each other, work on the same task. Therefore we can identify chunks of code which can be understood as actions and tasks (open a file, read a specific object from file, etc.). And, most important, there will be a sense of purpose. Code doesn’t just sit there for the sake of being, it has a purpose and this purpose is connected to the purpose of the whole program. If this code looks into the properties of a „person object“, it will propably do something relevant with that data. By examining which properties the code reads and writes, what other actions are performed in its vicinity and which kinds of objects it combines, we can narrow down the set of possible purposes of that code. To sum this up: Even for a piece of assembler code it is possible to split it into chunks and understand the whole piece in a more abstract sense, assign a purpose to it.
The second aspect of the question „what does it do?“ is connected to the specific behaviour of the code for a given state of the system. For example, if we have a initialized person and we run our code on it, how will the state of the person change exactly? This second aspect of the question often is much more complicated to answer than the first, because it normally requires a line-by-line tracing of the program flow. Why? Because in assembler code, the state of the program is largely represented by the state of the CPU. The CPU does own only a limited number of state variables. It is like reading a C listing in which the number of variables has been limited to six variables. There will be a lot of value-switching and temporarily parking. And all this state we have to keep in mind during tracing. The longer and complicated our code gets, the more state we have to keep in mind. A possible solution might be to break down the problem into smaller chunks by identifying partitions of the code which work indepentently of each other. But the very task of ruling out the dependency between to pieces of code often required a deep understanding of the state of the CPU and the memory at as specific point of execution.
It seems as if we cannot avoid tracing code. Debuggers were invented for that case. They introduce the great advantage that they keep track of the state (CPU and memory) for us. We can concentrate on the semantically relevant points in the code and forget the rest.
But what if we don’t have a full, executable program at our hand? If we had, we would also have to worry about whether our piece of code is executed at all. We would have to provide a lot of state (hardware state, memory, operating system, execution environment) just to execute our little snippet! What if we just want to understand understand this little, isolated piece of code? A classical debugger is always restricted to follow a single path of execution, given its state. But we might want to know what happens if we take the other road at a specific junction. Does the result of the execution even depend on a specific state variable?
Those questions cannot be answered by classical debugging techniques. What we require is
- Minimal state. We want to provide only those state variables which are important for the behaviour of our code. If the state of the graphics card is never queried, we don’t want to provide it.
- Alternative paths. We want to be able to overview all the possible alternative routes of execution through our code and their dependency on the „input variables“.
What we want might be symbolic execution. A technique developed to (automatically) prove the correctness of programs. And this is the idea, I currently work on. There are frameworks to do symbolic execution of binary programs, already (https://github.com/feliam/pysymemu). But I don’t want to do it with a complete program bug simply an isolated piece of assembler code.