BIOS Boot to D — The Art of Machinery

After my previous post on using D for C-like programming, I wondered about going deeper. What’s the minimum it would take to run C-like D code on a PC? Could it run straight from a BIOS bootloader?

If you seriously want to make an OS, you’re much better off using an existing bootloader like U-Boot or GRUB, or at least using UEFI. But doing it this way is interesting because the x86 PC has an insane level of backwards compatibility, and booting from the BIOS to a modern high-level language is like doing an archealogical dig through the past 40 years of computing history.

Super Quick Crash Course in Assembly Language

Assembly language is just a more readable machine code. Every assembly instruction translates to a binary machine code instruction, and vice versa. I use Intel syntax (specifically NASM syntax), which would probably be a de facto standard if GNU toolchains (gcc, and so on) didn’t use AT&T syntax for ~~hysterical~~ historical reasons. I first learned this stuff hacking around on MS-DOS, and I still prefer Intel syntax.

This code:

mov ax, 0x20

is like

ax = 32;

except that ax isn’t a normal variable in memory, it’s a hardware register in the CPU, but it holds integer values in much the same way. 0x20 is just the number 32 written in the hexadecimal number system, rather than the usual decimal system. Either system can be used, but hexadecimal just happens to be more natural when working with low-level stuff. There are plently of tutorials explaining how to convert the two, and you possibly already have some kind of calculator on your computer that can do conversion for you.

Normal variables are just numbered addresses in memory. You put a value into some addressed location, and it’s up to you to remember where you put it. Actually, the NASM assembler lets us give labels to the addresses, mimicking variables in a high-level language, but it all turns into raw addresses in machine code. If you disassemble machine code back into assembly, you lose any labels.

mov byte [0x8000], 42  ; Store a value in memory
mov byte [the_magic_number], 42  ; Same thing, but we've defined a meaningful label for the address

There are no control structures like if statements and while loops, only goto statements (or “jumps” in x86 assembly). For example, this code

byte foo = 6;
// stuff
if (foo == 6)
{
        // do something
}

would be implemented as something like

db foo 6
; stuff
cmp byte [foo], 6  ; Compare with 6
jne skip_do_something  ; Jump (if) Not Equal
; do something
skip_do_something:

Once again, in machine code, jumps actually go to raw memory addresses, but the assembler lets us give them meaningful labels. The cmp instruction sets bits in a special register called the flags register, and that’s how the jne instruction detects the result.

That’s enough assembly to follow this post, but if you want a deeper understanding, there are good tutorials around. One more thing that I should mention is that even if something might seem like syntactically good assembly, it might not actually be an operation supported by the hardware. So often a little extra juggling around is needed to get things done.

The Journey Begins

When you first power on a PC, the Basic Input/Output System (BIOS) is the first thing that starts. At least, I’ll pretend it is because modern systems are still compatible with that. The BIOS first does a Power On Self Test (POST) of the basic hardware. Older BIOSes used to make a cheerful chirp through the PC speaker when this passed, but that seems to have gone out of fashion. (Update: It’s more like PC speakers have gone out of fashion. Thanks agent-plaid) Next the BIOS goes through each disk in some order, loading the first 512 bytes and checking if the last two bytes are 0x55 and 0xaa. That’s the marker for a bootable disk. Why 512 bytes? Because, for performance reasons, disks work in 512-byte chunks called sectors, instead of operating byte-by-byte. Why 0x55 and 0xaa? I don’t know; that’s just the way it is. (Update: crimaniak explains that it’s to avoid booting from a bad drive.) Anyway, the raw 512 bytes from the bootable disk are copied verbatim to address 0x7c00, and then the BIOS starts the CPU executing from there. This is where our job begins.

This bootloader is simpler than ones like GRUB, so all the boot code will be in the first 512 bytes of the disk image, and the OS payload (the D code) will sit directly after it.

First Steps

Let’s trace through the code.

We land at 0x7c00 with the x86 chip running in what’s called 16b real mode. This is backwards compatible with very early x86 devices, and is designed for old computer systems that run one program at a time, with full access to the entire machine. This might sound ideal for what we’re doing, but Intel engineers realised that “real mode” wasn’t the future of computing around the time of the 286 (1982), and many x86 features since then are not supported in real mode. The most obvious limitation is that we can only access approximately 1MiB of RAM. So, one of the major tasks of the bootloader is to reconfigure the x86 to stop pretending to be almost-40-year-old hardware. But first, there’s a little housekeeping.

    mov [boot_disk], dl

The bootloader could have been installed on any disk, so the BIOS helpfully tells us where it found it by putting the disk number into the machine register dl. We’ll need to know this when we load the rest of the disk image, so we save it now before we accidentally clobber it.

Now, I said we’re in 16b mode, and a 16b pointer can address up to $2^16 = 65536$ bytes. How do we access 1MiB of RAM? Well, 16b x86 has a slightly quirky addressing system in which memory is split into segments and each 16b pointer is interpreted as an offset into the current segment, as defined by various 16b segment registers. Doesn’t that mean pointers are 32b total, and can address 4GiB? No, because the segments are actually made to overlap, and we can only address 1MiB. I guess 4GiB seemed like an absurdly huge amount of memory back in the day. Anyway, if you’ve ever wondered what the “near” and “far” keywords in old C code mean, they’re for dealing with this mess — the flat memory addressing of modern systems is bliss. For simplicity, I’m going set all the segment registers to zero and, as much as possible, just work in the bottom 65536 bytes as a flat memory space, at least until we break out of real mode.

    cli  ; Disable interrupts
    xor ax, ax  ; Equivalent to mov ax, 0
    mov ds, ax
    mov es, ax
    mov ss, ax
    mov sp, 0x7000  ; Set stack pointer
    sti  ; Enable interrupts

While setting these registers I disabled interrupts, which are like special system function calls that are called by number, not name. The crucial thing is that they can be triggered any time by hardware events (hence the name “interrupt”), so we need to disable them when doing certain sensitive tasks. The really sensitive thing here is setting the address of the stack, which is temporary working space used by functions (including interrupts). The stack pointer actually points to the top of the stack, and data is stored moving downwards. I’ve put it just below the boot code in memory because I’m putting nothing else there. By the way, this x86 memory map is handy if you’re wondering where things can go.

    ; Activate A20
    mov ax, 0x2401
    int 0x15

Okay, this is a crazy one. Even if the very early x86 machines hadn’t had that strange addressing scheme, they’d still only be able to access 1MiB of RAM because, in the actual hardware, the chips only had 20 address lines, labelled A0 – A19, and $2^20 = 1Mi$ . (For comparison, modern machines have 64b pointers but only 48 address lines.) Any address above the 1MiB limit in software would “overflow” the hardware address lines and wrap around, just like integer variables can overflow in software today. When Intel brought out the 286, they wanted to support more memory, so they did the natural thing and added more address lines. When IBM tried to build PCs based on this new chip, they discovered that existing software was relying on the address overflow behaviour for performance hacks. They were faced with a choice:

Tell customers that their software would crash on the new 286-based PCs
Tell the hardware engineers to make things work

Of course, they went with the second option. A special logic gate was added to the motherboard that would force the A20 line to zero by default to mimic the old behaviour. New software that wanted to access more memory would need to explicitly enable the A20 address line. Adding a whole new hardware controller for a single-bit gate seemed excessive, so it was hacked into the existing keyboard controller system, which meant code that enabled the A20 line had to be careful not to interfere with the keyboard hardware. Actually, real-world PCs ended up with various methods for controlling the A20 gate, but I was lazy and went with standard # $N+1$ and used a designated BIOS interrupt.

As Intel developed more complex and sophisticated chips, A20 gate logic had to be built into the x86 itself to be compatible with older systems. So, in the 21st century, x86 chips boot up with crippled address lines thanks to a chain of backwards compatibility that originates in what seemed like a really cool performance hack to developers when WordStar was new technology. Actually, a few years ago Intel announced that were dropping support for the A20 gate, so one day all this will be like a bad dream.

Loading the OS, Entering Protected Mode and Launching

Now we can start work on the two main jobs of the bootloader: loading the OS from disk and entering 32b “protected mode” so that it’ll run properly.

    .load_loop:
        call load_payload_or_execute  ; function call
        cmp word [payload_sectors_remaining], 0
        jnz .load_loop  ; Jump (if) Not Zero

There’s a bit of a dilemma with loading the OS. The BIOS provides disk drivers in the form of interrupts, which is really handy. The downside is that they only work in real mode because protected mode breaks compatibility with BIOS interrupts. But in real mode we can only access the bottom ~1MiB of RAM. So, if we want to support OS payloads bigger than 1MiB, and we don’t want to write our own disk drivers (which would require a lot more code and testing), we’ll have to switch back and forth from real mode to protected mode, copying a bit of the OS at a time.

Switching between real mode and protected mode isn’t trivial, so it would be nice to put this code into reusable functions. There’s a problem with that, though, too: real mode and protected mode use different stacks, so the functions would have to enter through one stack and return through the other. This isn’t impossible, but it’s a little hairy and the approach I use instead is simpler. The loading and running process looks like this in pseudo-code:

repeat
        load_payload_or_execute()
until payload_sectors_remaining == 0

function load_payload_or_execute()
        -- load chunk of OS into low-memory buffer
        -- enter 32b protected mode
        -- copy OS chunk into high memory
        if payload_sectors_remaining == 0
        then
                run_os()
        end
        -- return to 16b real mode
end

Note that load_payload_or_execute() always returns back to real mode, and we can be in one of two states:

there’s still more OS data to be loaded, in which case we need to run the routine again to load more
the OS was fully loaded and was executed, in which case we’re done

Let’s look at the implementation in detail.

Read From Disk

There’s not much to see here, so I won’t dwell on this part. It uses the BIOS disk driver to load 64kiB into a buffer in memory that’s accessible from real mode. Most of the code is just keeping track of pointers and lengths.

    ; Figure out how many sectors to load
        mov word [disk_address_packet+dapa_num_sectors], sectors_per_read
        cmp word [payload_sectors_remaining], sectors_per_read
        ja .full_read  ; Jump (if) Above
        mov ax, [payload_sectors_remaining]
        mov [disk_address_packet+dapa_num_sectors], ax
        .full_read:

        ; Read data
        mov ah, 0x42
        mov si, disk_address_packet
        mov dl, [boot_disk]
        mov cx, 10  ; Up to 10-1=9 attempts
        .retry:
        dec cx
        jz shutdown
        int 0x13
        jc .retry

        ; Keep track of how much we've read so far and where to put the next chunk of data
        xor eax, eax
        mov ax, [disk_address_packet+dapa_num_sectors]
        sub [payload_sectors_remaining], ax
        add [disk_address_packet+dapa_lba_lo], eax
        adc dword [disk_address_packet+dapa_lba_hi], 0
        shl ax, 7  ; Number of sectors -> number of 4B dwords
        mov [last_read_dwords], eax

This code is slow, and only seems to work on hard disks (not floppies or USB drives), on newer BIOSes, but it’s good enough for demo purposes.

Entering 32b Protected Mode

Actually entering protected mode itself is super simple, but there’s a fair bit of preparation needed to make it work properly in practice.

    cli
        mov [real_mode_sp], sp

We’re going to do some delicate messing around here, so the first thing we’ll do is disable interrupts. We need to swap to a different stack for protected mode, so we’ll save the current (real mode) stack pointer so we can swap back later when we return to real mode.

    lgdt [gdtr]

This is a special opcode that loads the new Global Descriptor Table (GDT) for protected mode. Remember I talked about segments before? Well, segments in real mode are just a way to access more RAM than is possible with a single 16b pointer. Each segment represents a hard-wired region of RAM, and it doesn’t really matter which overlapping segment you use to access a specific byte.

With protected mode, Intel decided to make segments configurable. They can be mapped to different memory regions, made read only, and configured for various modes and protection levels. The idea was that operating systems could set up different segments for different purposes — some for the OS kernel, and others for tasks. In reality, the major OSes ended up using a later system called paging instead to do the same job, better, but the segment approach is good enough here.

You can read more about the GDT here. I’ll just say I’ve set up three segments that all treat the entire 32b memory address space as one flat chunk of memory: one configured for data, one configured for 32b code mode, and one configured for 16b code mode. The 16b segment is only needed briefly for returning to real mode, but more on that later.

Now to enter protected mode:

    ; Enable protected mode
        mov eax, cr0
        or al, 1
        mov cr0, eax

        jmp seg_code32:.enter_32b
        .enter_32b: bits 32
        mov ax, seg_data
        mov ds, ax
        mov es, ax
        mov fs, ax
        mov gs, ax
        mov ss, ax
        mov esp, stack_address

cr0 is one of a set of special CPU control registers, and we set the lowest bit to enable protected mode. Because protected mode is pretty much configured using the segments in the GDT, this doesn’t actually have any effect until we explicitly reload the segment registers. The first jmp simply jumps to the next instruction, but using the 32b mode segment for code. All the other segments get set to the data segment, and we also set up the new stack while we’re at it.

We’re in 32b mode and officially more advanced than a 286 now :)

Copy OS Payload to Higher Memory

Previously we loaded 64kiB from disk to a buffer. Now that we aren’t restricted to 1MiB of RAM, we can copy it to its final resting place. Remember, this will happen several times for a big disk image.

    mov eax, disk_buffer
        mov esi, eax
        mov edi, [payload_write_ptr]
        mov ecx, [last_read_dwords]
        rep movsd
        mov [payload_write_ptr], edi

Execute the Payload

    cmp word [payload_sectors_remaining], 0
        jnz .skip_payload_execute

As described before, if we haven’t loaded the entire disk image yet, we’ll have to go back to real mode and load some more. Let’s assume we’re done loading.

    push dword 0x20  ; Push value onto stack working space (as argument for next function call)
        call move_irqs
        add sp, 4  ; Restore stack after function call

This is another PC snafu that needs to be fixed up. I’m not sure exactly what went wrong here, but in protected mode there’s a conflict between the interrupts that IBM reserved for hardware and the incompatible interrupts that Intel reserved for CPU events (like executing an unrecognised instruction). Both companies reserved interrupts starting from number 0. The good news is that hardware interrupts aren’t hard-wired to the x86 chip. A hardware device makes an IRQ (“interrupt request”) that the Programmable Interrupt Controller (PIC) maps to an actual interrupt. We can reprogram the PIC to map IRQs to another set of interrupt numbers that’s free. The next free space is interrupts starting from number 0x20, and that’s a popular place to put them.

The move_irqs function isn’t very interesting, so I won’t go into it here. But note the way the stack is used for handling the function argument. This is one of many function calling conventions used in practice.

    ; Mask all IRQs for now
        mov al, 0xff
        out 0x21, al  ; Hardware port for master PIC
        out 0xa1, al  ; Hardware port for slave PIC

This bootloader doesn’t install any hardware drivers for handling hardware interrupts, so we disable (“mask”) all IRQs in the PIC. The payload can unmask interrupts for any hardware it has drivers for.

We’re almost ready to run the payload now.

    mov edi, bss_start
        mov ecx, bss_size
        xor al, al
        rep stosb

Here we clear the .bss section of the payload. The payload contains code and data from the D code, but a lot of variables don’t need to be initialised to any special value, so as an optimisation they don’t get put into the binary image. They get put into this .bss section, which needs to be generated at runtime as an area of memory filled with zeros.

    finit

One last step. Here we reset the floating point hardware. Early x86 chips only natively supported integer operations, and floating point had to be done (slowly) in software. Intel brought out x87 “math coprocessors” as special optional chips that could be added to boards to accelerate floating point calculations. Because they were separate chips, using them was a little more complicated than using the native integer operations. Eventually the x87s were built into the x86 chip itself, starting with the 486, but they still have the same quirky interface for backwards compatibility.

Basically, the floating point hardware carries a fair bit of internal state, and programs that do any floating point calculations will get weird results if they’re not in sync with that state. Of course, we could just try to avoid using floating point, but this one reset operation saves headaches.

    call systemEntry

Finally! We’re calling into the D code!

In D Land

I’ll talk more about writing hardware drivers and bare-metal D in the next part, so for now I’ll just leave some proof-of-principle code here. It dumps a message to the serial port (COM1).

module hello;
@nogc:
nothrow:

extern(C) void systemEntry()
{
        foreach (char c; "Hello from D\n")
        {
                putChar(c);
        }
}

enum kCom1DataPort = 0x3f8;
enum kCom1LineStatusPort = 0x3fd;

void putChar(char c)
{
        ubyte line_status;
        do
        {
                line_status = inPortByte(kCom1LineStatusPort);
        } while (!(line_status & (1 << 5)));  // Wait for transmit buffer empty flag

        outPortByte(kCom1DataPort, cast(ubyte) c);
}

void outPortByte(ushort port, ubyte b)
{
        asm @nogc nothrow
        {
                mov DX, port;
                mov AL, b;
                out DX, AL;
        }
}

ubyte inPortByte(ushort port)
{
        ubyte result;
        asm @nogc nothrow
        {
                mov DX, port;
                in AL, DX;
                mov result, AL;
        }
        return result;
}

Shutting Down

When the D function finishes, we land back inside the bootloader code, thanks to the magic of stacks.

cli

We disable interrupts again in case the D code enabled them. (It didn’t, but it doesn’t matter.)

Remember that we need to go back into real mode so that this payload-executing function can return safely. Before we do that, we need to move hardware interrupts back where they were.

    push dword 0x0
        call move_irqs
        add sp, 4

Now we’re going back to 16b real mode before returning from the function. Remember, we do this regardless of whether we actually ran the OS payload. It seems to take a few steps: going to 16b mode, setting real mode in the control register, then activating real mode by reloading the segment registers.

    .skip_payload_execute:

        ; Break back into 16b real mode
        cli
        jmp seg_code16:.enter_16b
        .enter_16b: bits 16
        mov eax, cr0
        and al, 0xfe
        mov cr0, eax
        jmp 0:.real_mode
        .real_mode:
        xor ax, ax
        mov ds, ax
        mov es, ax
        mov ss, ax
        mov esp, [real_mode_sp]
        lidt [real_mode_idtr]
        sti

        ret  ; Return back to the loading loop

The lidt resets any interrupt handlers the D code might have set up, so we can use BIOS interrupts again. More on that next time.

Once the function returns, we break out of the loading loop and enter the shutdown sequence. Conveniently, because we’re back in real mode, we can use the BIOS’s APM driver to do this. If it doesn’t work, we’ll just make the CPU hang.

shutdown:
        ; APM shutdown
        mov ax, 0x5300
        xor bx, bx
        int 0x15
        jc shutdown_fail

        mov ax, 0x5303
        xor bx, bx
        int 0x15
        jc shutdown_fail

        mov ax, 0x5308
        mov bx, 0x0001
        mov cx, 0x0001
        int 0x15
        jc shutdown_fail

        mov ax, 0x5307
        mov bx, 0x0001
        mov cx, 0x0003
        int 0x15

        shutdown_fail:
        cli
        hlt
        jmp shutdown_fail

Putting it All Together

That’s a lot of theory. You can find the complete code in my repo. Let’s put it into action:


$ # Assemble the bootloader
$ nasm -f elf32 -o boot.o boot.asm
$ # Compile the D code
$ dmd -c -boundscheck=off -release -betterC -debuglib= -defaultlib= -m32 hello.d
$ # Rip out runtime dependencies
$ objcopy -R '*.eh_frame' -R '*.d_dso_*' -R '*.__dmd_personality_v0' --strip-unneeded hello.o
$ # Build the disk image using a linker script
$ ld -T image.ld -o image.bin --gc-sections -m elf_i386 hello.o boot.o
$ # Boot it up!
$ qemu-system-x86_64 -serial stdio -drive file=image.bin,format=raw
Hello from D
$

That’s a lot of work to get a hello message. Of course, “hello world” is never a real test of a platform, so I’ve been doing some more experiments. I’ll write more about bare-metal D in followup posts, but here’s a little demo for now:

Youtube link

Update: see the programming subreddit for discussion of this article.