In this post, I'll be detailing how C and C++ can be programmed in a sane and memory-safe way. We'll be going through some custom memory allocator strategies that provide simple ways of dissolving common memory problems.

This post is inspired by my C++ core library called Presheaf, whose source code is freely available. The main goal of the library is to deliver the programming experience I wanted and C wasn't able to fully deliver.

Over the last few years C got the fame of being a memory unsafe language due to its lack of intrinsic constraints that ensure proper object lifetime and ownership management - like you commonly have in new trendy languages. You might ask why bother learning memory strategies in C if I can simply use my language Z that has a borrow-checker or a garbage collector: well, language Z is not suited for all use cases - surely not for me. The borrow-checker is a pain to deal with and most memory strategies in that land, like ECS, are based in making the compiler shut up by hiding intricate lifetimes inside of longer living objects. Do I even need to talk about the garbage collector approach?

Building your own memory system

When writing any API, one should question itself the scope and the audience of the API. In the case of a memory management system, the main concern is centred in memory safety. How much safety guard-rails should we build, and how much hand-holding does our end-user need?

Guard-rails obviously comes with their own performance costs. For instance, if we wanted a very robust system that disallows for any use-after-free, we wouldn't be returning direct pointers to the memory to our users. To deal with that, we could use handles and manage their coherence internally (check this post, by Andre Weissflog, for more information). This would require every allocation to have a unique ID, every read request to be checked if ID's match, etc.

In my personal case I don't really need all of this hand-holding and training-wheels. All I need is a performant system that can be easily managed, and ensures some level of memory safety. The Presheaf library leans into this very philosophy - obvious protocols exist between caller and callee, if the caller wants to break one of the these protocol assumptions, the callee won't try to stop the caller fearing stupidity of the programmer. In summary, the programmer is always treated as an intelligent being that knows what they are doing. As simple and obvious as this philosophy may seem, the current state of the software industry took a turn in favor of lazyness, with the excuse that the "developer experience" is the king. I certainly don't follow this way of thinking, in fact: the end-user is the king, and performance is queen.

Alignment: rules for memory reads

The CPU memory reads cannot be done willy nilly at any given address. Modern architectures are optimised to read contiguous memory with a certain alignment - which makes stepping through memory a regular task (as opposed to jumping around randomly). This alignment is always a power of two and depends on the size of the memory units (e.g. a struct member) we want to read. In C++ you can query the memory alignment for a given type T using alignof(T) (in C, you can use _Alignof(T)).

For instance, if we have an array of floats (each float with a size of 4 bytes), the address of the nth element of the array in memory should be of the form 4 n + c where c is the address of the first element of the array. Hence we say that the alignment of a float is 4 bytes.

For structs, the compiler may need to add paddings in order to satisfy alignment conditions. A lost art in programming is the arrangement of struct members for optimal alignment. Let's see this in practice.

Suppose I have a struct Foo that has the following memory layout:

struct Foo {
    uint8_t* memory;
    uint32_t allocation_count;
    double   some_metric;
    float    some_other_metric;
};

From the point of view of the compiler, the actual memory layout of Foo has to make sure that the alignment of each struct member of is valid. Thus in reality, the arrangement of bytes composing Foo is laid as follows:

struct Foo {
    uint8_t* memory;             // 8 bytes.
    uint32_t allocation_count;   // 4 bytes.
    // ---------------------------> Invisible padding of 4 bytes.
    double   some_metric;        // 8 bytes.
    float    some_other_metric;  // 4 bytes.
    // ---------------------------> Invisible padding of 4 bytes.
};

Notice how we're wasting 8 bytes of memory for each instance of our struct.

But really, why does the compiler put those paddings between our members? For the sake of contradiction, suppose that the compiler didn't put those padding bytes. If we wanted to access the member Foo::some_metric (which has an alignment of 8 bytes), we would start to read Foo in the same address as the first member - then we would advance by our alignment of 8 bytes at a time until we supposedly reach Foo::some_metric. Surprise! We won't be able to reach our destination, in fact, we would've passed the address of the target by 4 bytes. For this exact reason, the compiler has to arrange memory for the worst case scenario - the largest alignment in the collection of members has to be the global alignment of the structure itself. In our case, the compiler sees that the largest alignment is 8 bytes.

One thing I didn't explain is why the compiler has to put those 4 bytes of padding after the last structure member. If for some reason you have a contiguous array of Foo instances, in order to traverse through the array we would use the alignment of Foo (8 bytes) - and if it wasn't for those last 4 bytes, we wouldn't reach the next Foo in the line! Once again, the compiler has to account for the worst case scenario.

Fixing our bad memory usage is simple, we just rearrange the members taking into account their sizes:

struct Foo {
    uint8_t* memory;             // 8 bytes.
    double   some_metric;        // 8 bytes.
    uint32_t allocation_count;   // 4 bytes.
    float    some_other_metric;  // 4 bytes.
};

Analogously, every time we allocate memory in our custom allocators we'll have to make sure to account for the necessary alignment restrictions.

Arena Allocator

The simplest allocator - yet sufficient for almost all use-cases - is the arena memory allocator. Its construction is ridiculously simple: a pointer to the block of memory being managed, the total maximum capacity of the block, and an offset relative to the start of the block to the free space:

struct Arena {
    uint8_t* memory;
    size_t   capacity;
    size_t   offset;
};

Having only this amount of information to deal with, the arena can only be used to accumulate allocations for a certain period and then free all of the allocated memory at once. This constraint is perfect for modelling the concept of a lifetime: objects allocated in the same arena, have a common lifetime end - that is, when the arena has its offset reset. This allows one to trivially deal with lifetime issues that languages like Rust try to impose in their compiler.

It is to be noted that your style of programming with arenas may differ from the typical programming you see out there. It is common to see programs where ownership of memory is poorly defined in the course of the program lifetime. For this exact reason, people in the land of "Modern C++" resort to "smart" pointers - instead of solving the root cause, this "solution" only remedy the problem of a poorly designed software by means of runtime cost and, consequently, absurdly horrible performance.

When programming with arenas, one commonly thinks in groups of allocations (hence lifetime groups), and chunks of objects. Work is mainly done with these chunks in mind, improving cache spacial and temporal locality. This way of programming is sometimes named "data oriented programming", but in reality this is simply the way computers where designed to be used - we don't even need a term for that, it's merely non-pessimised programming if you think about it.

Many of the made-up problems created by modern software practices are completely dissolved when you design your program with memory arenas in mind instead of thinking about the program in the object-level. For instance, ownership and lifetime problems are almost a non-issue and an easy to solve problem.

Allocating memory blocks

When allocating a new block of memory in the arena, we have to account for the alignment of the structure or array that will be allocated. By the simplicity of the arena, computing the next address that satisfy the required alignment is a pretty simple task:

size_t align_forward(uintptr_t ptr_addr, size_t alignment) {
    size_t mod_align = ptr_addr & (alignment - 1); // Same as `ptr_addr % alignment`
    if (mod_align != 0) {
        ptr_addr += alignment - mod_align;
    }
    return ptr_addr;
}

where the parameter alignment is assumed to be a power of two.

Having this auxiliar function at hand, making a new allocation can be very easily done:

uint8_t* arena_alloc_align(Arena* arena, size_t size_bytes, uint32_t alignment) {
    if (arena == nullptr || arena->capacity == 0 || size_bytes == 0) {
        return nullptr;
    }

    uintptr_t memory_addr    = reinterpret_cast<uintptr_t>(arena->buf);
    uintptr_t new_block_addr = align_forward(memory_addr + arena->offset, alignment);

    if (new_block_addr + size_bytes > arena->capacity + memory_addr) {
        // Not enough free memory.
        return nullptr;
    }

    // Commit the new block of memory.
    arena->offset = static_cast<size_t>(size_bytes + new_block_addr - memory_addr);

    uint8_t* new_block = reinterpret_cast<uint8_t*>(new_block_addr);
    memset(new_block, 0, size_bytes);

    return new_block;
}

You should also create procedures that deal with the following operations: clearing the arena, resizing an already allocated block of memory, etc.

Temporary allocations with checkpoints

Having the constraint of only being able to free all memory at once can be a bad restriction once you want to perform temporary computations that shouldn't be sticking around in the allocator. In order to overcome this constraint, we can create a checkpoint system that records the current state of the arena and is capable of restoring the allocator once the memory allocated from the checkpoint onwards isn't needed anymore. This amounts to a simple implementation like the following:

struct ArenaCheckpoint {
    size_t saved_offset;
};

ArenaCheckpoint arena_make_checkpoint(Arena* arena) {
    return ArenaCheckpoint{arena->offset};
}

void arena_restore_state(Arena* arena, ArenaCheckpoint* checkpoint) {
    arena->offset = checkpoint->saved_offset;
}

You can certainly extend the behaviour of checkpoints. For instance, you can create a kind auto-restoring checkpoint with the use of destructors. Moreover, I would recommend creating debug checks in order to ensure the correctness of the checkpoints being restored (e.g. does it come from the same arena? is the offset valid? etc.).

Stack Allocator

A stack allocator is nothing more than a contiguous memory block which we divide in order to offer memory space to consumers. In order to avoid memory fragmentation, we only allow the last allocated block to be freed.

Headers: storing relevant information

Each memory block allocated by our stack allocator will be accompanied by a header that will carry some basic information about the memory block it's associated with.

struct StackAllocHeader {
    size_t padding;
    size_t capacity;
    size_t previous_offset;
};

Let me explain what each one of these fields mean:

padding: The offset relative to the end of the previously allocated memory block until the start of the current memory block. This is here due to the different alignment requirements of each block.
capacity: The total capacity, in bytes, of the current memory block.
previous_offset: The offset relative to the start of previously allocated block until the start of the current memory block. This will help us to traverse the stack backwards.

You can visualise the header members as follows:

         previous offset              |alignment|              |------capacity------|
                |                     |         |              ^                    ^
                v                     v         v              |                    |
|previous header|previous memory block|+++++++++|current header|current memory block|
                                      ^                        ^
                                      |---------padding--------|

Allocator Structure

On to the stack allocator proper! The basic layout of the allocator looks like this:

struct Stack {
    uint8_t* memory;
    size_t   capacity;
    size_t   offset;
    size_t   previous_offset;
};

memory: Pointer to the start of the memory block managed by the allocator.
capacity: The maximum capacity in bytes of the allocator's memory block.
offset: The offset, relative to memory, to the start of the region available for allocations.
previous_offset: The offset, relative to memory, to the start of the last allocated memory block.

This can be visualised by the following diagram:

                                         offset
                                           |
                                           v
  |header 1|memory 1|++++|header 2|memory 2|++free space++|
  ^                               ^                       ^
  |                               |                       |
memory                         previous                  end
  |                             offset                    |
  |                                                       |
  |                                                       |
  |--------------------- capacity ------------------------|

The allocator can be either the owner or merely the manager of the memory pointed by Stack::memory. In Presheaf I opted for using allocators as mere managers, so you need to initialise them with a valid pointer to a previously allocated block of memory.

Allocating Blocks

Each block provided by the stack allocator consists of an alignment offset with respect to the end of the previous block, a header, and the available block of memory requested by the user.

      |alignment|    memory
      |         |      |
      v         v      v
| ... |+++++++++|header|memory block| ... |
      ^                ^
      |                |
      |----padding-----|

The block of memory is preceded by a padding that comprises both the alignment needed for the memory block and its corresponding header.

In order to compute the padding needed by the block we can implement the following function:

size_t padding_with_header(
    uintptr_t  ptr_addr,
    size_t     alignment,
    size_t     header_size,
    size_t     header_alignment) {
    // Calculate the padding necessary for the new block of memory.
    size_t padding   = 0;
    size_t mod_align = ptr_addr & (alignment - 1);  // Same as `ptr_addr % alignment`.
    if (mod_align != 0) {
        padding += alignment - mod_align;
    }
    ptr_addr += padding;

    // Add the padding necessary for the header alignment.
    size_t mod_header = ptr_addr & (header_alignment - 1);  // Same as `ptr_addr % header_alignment`.
    if (mod_header != 0) {
        padding += header_alignment - mod_header;
    }

    // The padding should at least be able to contain the header.
    padding += header_size;

    return padding;
}

It should be stressed that the alignment and header_alignment parameters should always be powers of two.

To allocate a new block of memory in the stack, we can proceed as follows:

uint8_t* stack_alloc_align(Stack* stack, size_t size_bytes, uint32_t alignment) {
    size_t current_capacity = stack->capacity;
    size_t current_offset   = stack->offset;

    if (current_capacity == 0 || size_bytes == 0) {
        return nullptr;
    }

    uint8_t* free_memory = stack->buf + current_offset;

    size_t padding = padding_with_header(
        reinterpret_cast<uintptr_t>(free_memory),
        alignment,
        sizeof(StackHeader),
        alignof(StackHeader));

    if (padding + size_bytes > current_capacity - current_offset) {
        return nullptr;  // Not enough memory...
    }

    // Address to the start of the new block of memory.
    uint8_t* new_block = free_memory + padding;

    // Write to the header associated with the new block of memory.
    StackHeader* new_header     = reinterpret_cast<StackHeader*>(new_block - sizeof(StackHeader));
    new_header->padding         = padding;
    new_header->capacity        = size_bytes;
    new_header->previous_offset = stack->previous_offset;

    // Update the stack offsets.
    stack->previous_offset = current_offset + padding;
    stack->offset          = current_offset + padding + size_bytes;

    memset(new_block, size_bytes, 0);
    return new_block;
}

Final comments

Manual memory management systems are actually quite fun and interesting. It is very simple to construct a consise, easy to use, safe, and performant memory system that will remove most of the common headaches created by bad software practices. It goes without saying that having a good memory system frees you from dealing with individual lifetimes, ownership problems, and the paranoia that malloc will inherently generate when used throughout the codebase.

Most of the time, all that you need is an arena. You can also combine the use of the arena with a stack allocator for more intricate memory arrangements.

If you wish to see the actual implementation of these allocators in the Presheaf library, please refer to the source code. The code written in this post has a very C-like touch, which was intentional with the purpose of offloading any complexities and making it very easy to port the code to C.

Memory management strategies