Hello world

This article was published on the 24th of April 2019. This article was updated on the 30th of April 2020.

The final goal of this article is to create a program that prints Hello World! using assembly instructions. At first, the used tooling is discussed, after which the sections of a binary are explained, together with example code of the same constructs in other programming languages. Lastly, the Hello World! program is constructed step by step using x86 assembly that is executable on x86 and x86_64 Linux distributions.

Note that all applications, aside from the platform specific linker, can be used on multiple platforms, including Microsoft’s Windows and numerous Linux distributions.

Table of contents

The set-up

To assemble assembly code into an object file, the Netwide ASseMbler (NASM) will be used. The object file is then linked using the GNU Linker, which is included on most Linux distributions. To see if the linker is present on your system, simply type ld in the terminal. The expected output is given below.

libra@laptop:~$ ld
ld: no input files

To install NASM on a Debian based system, use sudo apt install nasm. For those who are using a different system, check your distribution’s repositories or download the latest release here. When executing NASM without any other parameters, the following output should be given.

libra@laptop:~$ nasm
nasm: error: no input file specified
type `nasm -h' for help

Last, but certainly not least, a text editor with proper syntax highlighting for assembly language is preferred, although not required. The suggested editor for this course is SciTE. To install it for your preferred platform, see this page.

Compiling

To compile a x86 assembly file with NASM on a x86_64 system, a few extra flags should be used. Below is a small script that builds and executes a x86 ELF binary based on the source code that resides within program.asm.

nasm -f elf32 ./program.asm -o program.o
ld -m elf_i386 ./program.o -o ./program
./program

Sections

Within the assembly language, multiple sections can be used. Three standardised sections will be discussed below.

.data

In here, data is allocated and instantiated. The size of the data that is allocated depends on the declaration directive. The directive is similar for all sizes, as it always starts with a d, which stands for declare.
When using db (declare byte), a single byte (8 bits) is declared. Additionally, one can use dw to declare a word (which is equal to two bytes, or 16 bits), dd to declare a double word (which is equal to four bytes, or 32 bits) or dq to declare a quad word (which is equal to eight bytes, or 64 bits).

.bss

In here, space is reserved to be used during the run time of the program. In short, one can only access this part of the memory during run time. To allocate memory, one should use the resb directive, which stands for reserve byte.

Trying to assign a value to some part of the .bss section results in an error, as can be seen below.

./helloworld.asm:19: warning: attempt to initialize memory in BSS section `.bss': ignored

.text

In this section all the instructions are placed, which is the part of the binary that is executable. Labels and other directives can also be declared here, and they will be used by the assembler and/or linker respectively.

A high level language example

Technically speaking, this example is not completely correct, but it provides some insight in the difference between the .bss and .data sections.

public class Example {
 
    int x;
    int y = 5;
 
    public Example() {
        x = 5;
    }
}

In this example, the variable x is similar to the .bss section. The variable y is similar to the .data section. The constructor similar of the class is similar to the .text section.

The variable x is declared at first, and only assigned a value during the run time. The fact that the compiler might change the code in some stage is out-of-scope for this example. The variable y is declared and immediately instantiated with a value. The constructor is part of the executable code, as such it represents the assembly instructions within the .text section within this example.

Hello World!

The sample program contains only two sections: .data and .text. The labels are declared in the .data section, as can be seen below.

section .data
    hello:     db 'Hello world!',10d
    helloLength:  equ $-hello

To create a label that refers to a part of memory that is allocated, one needs to use the following syntax: labelName: sizeToAllocate ‘value’.
The name of the first label is hello, after which the size is specified by using db, which stands for declare byte. Then, the value that is (eventually) placed at the address of the label is given, in this case the value is equal to Hello World!.

Additionally, another part of memory is also declare using db: 10d. The d specifies that the number should be interpreted as a decimal number. Writing 0ah or 0x0a is also possible, as it is the same value but in a hexadecimal representation. Note that ah is not possible because that is the higher half of the 16-bit accumulating register. The value 10d or 0xah is equal to the newline character, as can be seen in the ASCII table.

The second label, which is named helloLength is set equal to the length of the previous string. The assembly point (the offset at which the assembler outputs insructions and data) at which hello resides, is subtracted from $.
The Dollar sign equals the current address of the assembler (during assembly; labels are not present in the compiled binary). If hello resides at address 90 and the current counter is equal to 100, then the result equals 100 – 90 = 10.

Below, the assembly instructions, that reside within the .text section of the code are given. Each instruction will be explained below the code.

segment .text
    global _start
    _start:
        mov eax,4
        mov ebx,1
        mov ecx,hello
        mov edx,helloLength
        int 80h

The directive global and symbol _start indicate that the symbol is globally defined. The content of the symbol is specified in the line below that.

The first four instructions are all mov instructions, after which the int 80h instruction is used. This instruction is used to call the Linux system. A list of all system calls on x86 Linux systems can be found here. The value in the eax register defines what system call is made. In this case, the value 4 is moved into the register, which represents the sys_write system call. More information is given below.

sys_write(unsigned int fd, const char * buf, size_t count)

The next three instructions set the values for the file descriptor (abbreviated as fd in the function signature above), the buffer’s address (a pointer to the buffer) and the length of the buffer (defined as the count in the function signature above). The file descriptor stdout is represented by the value 1.

At last, the program should be shut down cleanly. The code for this is given below.

        mov eax,1
        mov ebx,0
        int 80h

Again, a system call is made. This time the called function sys_exit, which only requires a single argument. The function signature is given below.

sys_exit(int status)

The value 0 is moved into the register ebx. This indicates a successful termination of the program. When executing the program, the observed result should be identical to the one that is given below.

Hello World!

Disassembling the program

To check the file type of the compiled binary, the GNU file tool is used. The result is given below.

helloworld:     ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped

This verifies the fact that the output is, indeed, a x86 ELF binary.

With radare2, one can disassemble the program to see the used assembly code. Opening and automatically analysing a binary with radare2 is possible using the -A flag, as can be seen in the output below.

libra@laptop:~/Documents/assembly-language-programming/hello-world$ r2 -A ./helloworld
[x] Analyze all flags starting with sym. and entry0 (aa)
[x] Analyze function calls (aac)
[x] Analyze len bytes of instructions for references (aar)
[x] Check for objc references
[x] Check for vtables
[x] Type matching analysis for all functions (aaft)
[x] Use -AA or aaaa to perform additional experimental analysis.

With afl all functions can be listed, which provides the information that there is only a single function present, as can be seen in the code below.

[0x08048080]> afl
0x08048080    1 34           entry0

Using pdf one can print the disassembly of the function. Set the variable e emu.sr to true before issuing the pdf command to see more information in the disassembly output, which is given below.

[0x08048080]> e emu.str = true
[0x08048080]> pdf
            ;-- section..text:
            ;-- .text:
            ;-- _start:
            ;-- eip:
/ (fcn) entry0 34
|   entry0 ();
|           0x08048080      b804000000     mov eax, 4                  ; [01] -r-x section size 34 named .text
|           0x08048085      bb01000000     mov ebx, 1
|           0x0804808a      b9a4900408     mov ecx, loc.hello          ; 0x80490a4 ; "Hello world!"
|           0x0804808f      ba18000000     mov edx, 0x18               ; loc.helloLength
|           0x08048094      cd80           int 0x80                    ; 4 = write (1, "Hello world!", 24)
|           0x08048096      b801000000     mov eax, 1
|           0x0804809b      bb00000000     mov ebx, 0
\           0x080480a0      cd80           int 0x80                    ; 1 = exit (0)

The instructions do not differ from the assembly code. At last, the r2dec decompiler can be used to view the same code as pseudo C code. This is done using the pdd command. The output is given below.

[0x08048080]> pdd
/* r2dec pseudo C output */
/* ./helloworld @ 0x8048080 */
#include <stdint.h>
 
int32_t entry0 (void) {
    /* [01] -r-x section size 34 named .text */
    eax = sys_write (0x1, "Hello world!", 0x18);
    eax = sys_exit (0x0);
}

Both the system calls are recognised and written in their C equivalent.


To contact me, you can e-mail me at [info][at][maxkersten][dot][nl], send me a PM on Reddit, or DM me on Twitter @Libranalysis.