1 - Binaries, dreaded Binaries!
What to implement
Functions
init_ijvm
destroy_ijvm
get_text
get_text_size
get_constant
In order to pass test1, correctly implement all of the functions listed above.
Introduction
In this chapter, you will be tasked with parsing an IJVM binary file. IJVM binaries, like other file formats, consist of bytes, arranged in a meaningful way:
4 BYTES | MAGIC NUMBER
4 BYTES | CONSTANT POOL ORIGIN
4 BYTES | CONSTANT POOL SIZE
SIZE BYTES | CONSTANT POOL DATA
4 BYTES | TEXT ORIGIN
4 BYTES | TEXT SIZE
SIZE BYTES | TEXT DATA
IJVM binaries consist of a 4-byte magic number followed by 2 blocks, namely the constant pool block and the text block. The constant pool block contains the constants and the text block contains the executable code. Each block starts with a 4-byte origin, which can safely be ignored, followed by another 4-byte size signifying the number of bytes the data consists of. The rest of the block contains the actual data.
Note: the 4-byte origin is useless in this project, and is a leftover from the original IJVM file format where it indicated where the a block should be loaded in memory.
Hint: both blocks have the same basic layout, so creating a reusable function that parses a block and stores its data in a buffer will make your code more readable.
For a more intuitive view of the IJVM binary file format, see our interactive binary explorer below.
Binary Explorer
To make this process somewhat easier, here's a visual guide for an example program Press the bytes to view their role.
NOTE:The entire file is in BIG-ENDIAN, while your computers work in LITTLE-ENDIAN. TL;DR Endianness is about byte ordering. Whether
[0x01, 0x00, 0x00, 0x00]
should be1
or0x01000000
, big endian means the first byte is the biggest, little endian means the first byte is the smallest.
[more info] ↪
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
0 | 1D | EA | DF | AD | 00 | 01 | 00 | 00 |
8 | 00 | 00 | 00 | 0C | FF | FF | FF | FF |
16 | 00 | 00 | 00 | 02 | 00 | 00 | 00 | 03 |
24 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 0F |
32 | 10 | 70 | 59 | 10 | FF | 60 | 59 | 59 |
40 | 10 | FF | 64 | FD | FD | FD | FD | |
JAS file ↪ | IJVM file ↪ |
Endianness
When reading any data type larger than 1 byte in size from a file, the value that is obtained depends on the endianness of the system: the order bytes are stored in. Bytes can be stored in two different ways: big-endian and little-endian.
In big-endian, the bytes are stored from most significant (big-end) to least significant. This reflects the way numbers are written on paper: we start with the most significant digit. In little-endian, the bytes are stored in the opposite order, from least significant (little end) to most significant; the bytes are stored backwards. Suppose we have the number 0x1DEADFAD
, where the most significant byte is 1D (numbers prefixed with 0x
are in hexadecimal). The order of the bytes is then:
// Little-endian
0xAD,0xDF,0xEA,0x1D
// Big-endian
0x1D,0xEA,0xDF,0xAD
The following program illustrates the difference between little and big endian, and you can compile it to know the endianness of your machine. If you understand this program and why it prints what it prints, you are in good shape. If parts of it are unclear, refer back to the introduction to C.
#include <stdint.h>
#include <stdio.h>
int main(int argc, char** argv){
uint8_t bytes[] = {0xAD, 0xDF, 0xEA, 0x1D};
// bytes[0] is now 0xAD, bytes[1] is 0xDF etc.
// Cast uint8_t pointer to a uint32_t pointer
uint32_t* wordpoint = (uint32_t *)bytes;
// interpret above bytes as a word by dereferencing pointer
uint32_t word = *wordpoint;
int reversed = 0x1DEADFAD; // hex number where 1D is the most significant byte
int straight = 0xADDFEA1D; // hex number where AD is the most significant byte
if(word == reversed) {
printf("This is a little endian machine\n");
} else if (word == straight) {
printf("This is a big endian machine\n");
} else {
printf("This machine defies any logic\n");
}
}
Ensuring the right endianness
The x86-64 or ARM architecture that your system is probably based on uses little-endian integers, while the IJVM binary file format uses big-endian integers. As a result, when you read a word from file, you will get the bytes in the wrong order: you will read in 0xADDFEA1D
, while 0x1DEADFAD
was the actual number. To ensure the right endianness of words and shorts, you can either:
- Interpret a series of bytes as a word (or short) and then swap the byte order.
- Convert a series of bytes to a word (or short) byte for byte, interpreting the first byte as most significant.
Swapping the byte order
If you read in words from a file, then the bytes are in the wrong order.
To reverse the order, i.e. to convert an integer from little-endian to big-endian, use the swap functions declared in include/util.h
, which are implemented in src/util.c
. The definitions of these are as follows:
uint32_t swap_uint32(uint32_t num)
{
return ((num >> 24) & 0xff) | ((num << 8) & 0xff0000) | ((num >> 8) & 0xff00) | ((num << 24) & 0xff000000);
}
There is also a version for a short (16 bits):
uint32_t swap_uint16(uint16_t num)
{
return ((num >> 8) & 0xff) | ((num << 8) & 0xff00);
}
Please refer to this discussion thread for an explanation of the function above.
Converting bytes into an integer
It is also possible to convert bytes into an integer use the read functions declared in include/util.h
, which are implemented in src/util.c
. The definitions of these are:
uint32_t read_uint32(uint8_t* buf) {
return (buf[0] << 24) | (buf[1] << 16) | (buf[2] << 8) | buf[3];
}
uint16_t read_uint16(uint8_t* buf) {
return (buf[0] << 8) | buf[1];
}
The example below breaks down the above methods:
int byte_1, byte_2, byte_3, byte_4;
byte_1 = 0x1D; // 0x0000001D
byte_2 = 0xEA; // 0x000000EA
byte_3 = 0xDF; // 0x000000DF
byte_4 = 0xAD; // 0x000000AD
// shifting 0x0000001D left by 24 bits (or 3 bytes) results in 0x1D000000
byte_1 = 0x1D << 24;
// shifting 0x000000EA left by 16 bits (or 2 bytes) results in 0x00EA0000
byte_2 = 0xEA << 16;
// shifting 0x000000DF left by 8 bits (or 1 bytes) results in 0x0000DF00
byte_3 = 0xDF << 8;
// shifting 0x000000AD left by 0 bits (or 0 bytes) results in 0x000000AD
byte_4 = 0xAD;
int number = byte_1 | byte_2 | byte_3 | byte_4; // 0x1DEADFAD
Note that shift left (l << s)
is defined as an operation which (efficiently) multiplies l
by 2^s
(irrespective of endianness), so this operations above state "interpret the first byte as the most significant byte, the second byte as the second most significant, etc.".
Hence, converting bytes to an integer this way will also work correctly on a big endian machine (see here for a blog advocating this method). This is not true for the method of interpreting the bytes as words and then switching the byte order which is explained above. However, we will not run your program on a big endian machine, and your program does not have to work on one.
Reading the file
Please ensure you have read the section on reading files in the C introduction.
To read in the file, make calls to fread
to read the parts of the binary, the magic number, the constant pool origin, and so forth until you have read and parsed the entire binary. The example below illustrates reading an integer from an IJVM binary:
uint8_t numbuf[4];
fread(numbuf, sizeof(uint8_t), 4, fp);
uint32_t number = read_uint32(numbuf);
To read a signed 16-bit or 32-bit number, simply read in an 16-bit or 32-bit unsigned number, and then cast it to a signed number of the same size.
When you are reading the constants, note that the number of constants is the constant pool size divided by 4, as each constant is a word (4 bytes).
Testing
To complete this task, you have to pass test1. To run this test, run make run_test1
. We check if you read in the file correctly, by calling your init_ijvm
method with one of the two ijvm
files from files/task1/
and then check if the text and constants have been read in correctly by calling your get_text
, get_text_size
and get_constants
methods. You can view the source code for the test in tests/test1binary.c
and the JAS files (which are compiled to IJVM files) in files/task1/
.
Hints
Make sure you have read the introduction to C.
Do not program in main.c, put all of your code in machine.c instead.
Do not compile your code manually, refer to the manual instead.
Get familiar with ijvm.h.
Start with implementing the
init_ijvm
function, as the rest of the functions depend on information obtained as a result of parsing the binary.It is a good idea to use
dprintf
(explained inmachine.c
) to debug your code.