Computer Systems Organization (Spring 2018)

Lab-3: Binary Mystery

Due: 3/25

In this lab, we give you 5 object files, ex1_sol.o, ex2_sol.o, ..., ex5-sol.o, and withhold their corresponding C sources. Each object file implements a particular mystery function (e.g. ex1_sol.o defines function ex1). We ask you to deduce what these mystery functions do based on their x86-64 assembly code and write the corresponding C function that accomplishes the same thing for each of the five functions.

Obtaining the lab

As usual, do a git pull upstream master in your lab directory. This lab's files are located in the binarylab/ subdirectory. The files you will be modifying are ex{1-5}.c.

Uncover the mystery of assembly

The object files whose assembly code you seek to understand are in the binarylab/objs/ subdirectory. Put your C code corresponding to objs/ex1_sol.o in file ex1.c. Put your C code corresponding to objs/ex2_sol.o in file ex2.c, and so on.

Suppose you set out to figure out what function ex1 (implemented in objs/ex1_sol.o) does. There are two approaches to do this. You should use them both to help uncover the mystery.

Approach 1:
Disassemble the object files. Read the assembly and try to understand what the function tries to achieve. To disassemble objs/ex1_sol.o, type:
```
$ objdump -d objs/ex1_sol.o
```
Approach 2
Run the function ex1 in gdb.
You might wonder how to run ex1 in gdb, since not knowing the function signature of ex1 makes it hard to write your own C code to correctly invoke ex1. How to run ex1 then? It turns out that you can utilize the tester that we have given you to run ex1 and observe how it executes.
To run the test with the given ex1 function, you need to link the test object file objs/tester.o together with the given objs/ex*-sol.o files. We have made this step easy by including appropriate Makefile rules. When you type make, you will see that there are two binary executables being generated, tester, and tester-sol. The executable file tester links our tester file with your object files ex*.o which are generated from your ex*.c files. The executable file tester-sol links our tester file with the given object file objs/ex*-sol.o. Thus, when you run ./tester-sol, the tester invokes the given functions, and needless to say, all tests should pass.
Run gdb tester-sol. Stop the execution when the function ex1 is invoked. Dissemble the function. Execute the instructions one by one. Form some hypothesis on what the function signature is and what it does. Verify your hypothesis during execution by examining register values and memory contents. To learn more about gdb, see the section on debug in recitation notes.

Do not try to match assembly

It is not the right approach to try to match the object code of your C function line-by-line to those contained in ex*-sol.o. Doing so is painful and not necessary. Differences in the compiler versions, compilation flags, and small differences in C code will all result in different object code, although they do not affect the code's semantics. Therefore, trying to find a C function that generates the same object code is likely futile.

Test your solution

After you've finished each function (remember to remove the assert(0) statement), you can test its correctness as follows:

$ make
$ ./tester
Testing Ex1...
Ex1: your implementation passes the test
Testing Ex2...
Ex2: your implementation passes the test
Testing Ex3...
Ex3: your implementation passes the test
Testing Ex4...
Ex4: your implementation passes the test
Testing Ex5...
Ex5: your implementation passes the test

The above ouput ocurrs when all your ex* functions pass the test.

To test multiple times, run ./tester -r with the -r option. This runs the tester using a new seed for its random number generator.

Some of you might want to skip around and implement the five ex* functions in arbitary order. This is a good strategy if you are stuck on some function. To test just ex2, type ./tester -t 2. Ditto with other functions.

Note: Passing the test does not guarantee that you will get a perfect grade (i.e. your implementation is not necessarily correct). During grading, we may use a slightly different test or manually examine your source code to determine its correctness.

Explanations on some unfamiliar assembly and others

For this lab, you need to review the lecture notes and textbook to refresh your understanding of x86 assembly. Below are some additional information not covered in the lecture notes that are helpful for this lab as well.

One can explicitly refer to lower-order bits of the registers. The names that you may find used in this lab are:

register : name to refer its lower-order portion
%rax     : %eax(lower-32 bit), %ax(lower-16-bit), %al(lower-8-bit).
%rcx     : %ecx(lower-32 bit), %cx(lower-16-bit), %cl(lower-8-bit).
%rdx     : %edx(lower-32 bit), %dx(lower-16-bit), %dl(lower-8-bit).
%rbx     : %ebx(lower-32 bit), %bx(lower-16-bit), %bl(lower-8-bit).
%r8      : %r8d(lower-32 bit), %r8w(lower-16-bit),%r8b(lower-8-bit).
...
%r15     :%r15d(lower-32 bit),%r15w(lower-16-bit),%r15b(lower-8-bit).

Note: For some reason, gdb does not recognize %r8b as a valid register name. Please just print register %r8 and manually find out its lower-8-bit to obtain the value for %r8b.

Often in the dissembled output, you encounter some instructions without any mnemonics suffix. For example, the mov instead of movl or movq (where l or q is called the mnemoics). In these scenarios, then treat the missing suffix as one that corresponds to the size of the destination register operand. For example, mov $1, %ebx is equivalent to movl $1, %ebx and mov %rax, %rbx is equivalent to movq %rax, %rbx.
movzbl instruction moves the 1-byte source operand (the b mnemonic) to the 4-byte destination operand (the l mnemonic) with zero extension. Instruction movslq moves the 4-byte source operand to the 8-byte destination operand with sign extension. That is, if the source operand is negative in two's complement (i.e. has 1 in its most significant bit), then the instruction pads 1s (i.e. fills the most significant 4-byte with 1s). There are more details on zero-extension and sign extension on Page 184-185 of the textbook.
The two byte instruction "repz retq" behaves identically as the one byte instruction retq.
If you disassemble an object file, (e.g. "objdump -d objs/ex1_sol.o"), you should not expect valid address for functions, because linking has not yet happened. If you want to see valid function addresses (i.e. those that appear as the operand for the call instruction), disassemble the binary executable (tester or tester-sol) or disassemble in gdb.

For those of you who want to go out in the world to explore other object files, you will find the official Intel instruction set manual useful. Note that in the Intel manual, the source and destination operands are reversed in an instruction (i.e. destination operand first, source operand last). In the lecture notes and gdb/objdump's disassembled output, the destination operand appears last in an instruction. These differences are due to two assembly syntaxes, AT&T syntax and Intel syntax. The GNU software (gcc, gdb etc) and lecture notes use AT&T syntax which puts the destination operand last and Intel manual (of course) uses Intel syntax which puts the destination operand first.

Handin Procedure

To handin your files, simply commit and push them to github.com

$ git commit -am "Finish lab"
$ git push origin

We will fetching your lab files from Github.com at the specified deadline and grade them.