How to Embed a Large File into a Program

Imagine your program needs to read from a large file to function, but you want to ship it as a single executable for security or ease of distribution. In this article, we will explore how to embed such a file into your program.

Scenario

Suppose we have a file, model.bin, that is essential for our program to work. Typically, you would package the executable and the model file together. However, this approach has drawbacks:

  • Cumbersome packaging: Managing two separate files can be inconvenient. If either is misplaced, the program fails.
  • Security concerns: A client could replace the model file, leading to unintended behavior.

Our goal is to embed model.bin directly into the executable, creating a single-file solution.

For simplicity, let’s generate model.bin as 64 KB of random data. In practice, it could be several gigabytes (e.g., for large language models):

head -c65536 /dev/random > model.bin

Here’s what our program will do with the file’s content:

// main.c
#include <stdio.h>

const unsigned char *data_start() { /* TODO */ }

unsigned int data_size() { /* TODO */ }

int main(){
  const unsigned char *data = data_start();
  unsigned int size = data_size();

  printf("model size: %u bytes\n", size);
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[i]);
  }
  printf("\n...\n");
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[size - 1 - i]);
  }
  printf("\n");
  return 0;
}

Method 1

The easiest method is to serialize the model file as a model.c file and generate the object

xxd -i model.bin > model.c

This generates a C array containing the raw bytes of model.bin:

// model.c
unsigned char model_bin[] = {
    0x50, 0xa4, 0x3e, 0x34, /* ... */ 0x81, 0x15, 0x4b, 0xce
};
unsigned int model_bin_len = 65536;

Compile this into an object file:

gcc -c model.c -o model.o

Modify main.c to use this embedded array:

// main.c
#include <stdio.h>

extern const unsigned char model_bin[]; // from model.o file
extern const unsigned int model_bin_len; // from model.o file

const unsigned char *data_start() { return model_bin; }

unsigned int data_size() { return model_bin_len; }

int main(){
  const unsigned char *data = data_start();
  unsigned int size = data_size();

  printf("model size: %u bytes\n", size);
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[i]);
  }
  printf("\n...\n");
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[size - 1 - i]);
  }
  printf("\n");
  return 0;
}

Now, we can compile and run the program to verify

gcc -o embed main.c model.o -Wall
./embed

model size: 65536 bytes
50 a4 3e 34 
...
81 15 4b ce 

This method is simple, but it has a serious limitation—it will not work with a large file. Not only will the model.c file be ~6x in size of the model.bin, but also the compiler will refuse to compile it if the file is too large. For example, with ~1GB of size, we get the following error with llvm

gcc -c model.c -o model.o

error: sorry, unsupported: file 'model.c' is too large for Clang to process
fatal error: error opening file '<invalid buffer>': 
2 errors generated.

Method 2 (for Linux)

Another way is to use objcopy to directly generate the object file without generating the source file.

# on x64 system
objcopy -I binary -O elf64-x86-64 model.bin model.o

To use this object file, we first need to examine what symbols are defined.

nm model.o
0000000000010000 D _binary_model_bin_end
0000000000010000 A _binary_model_bin_size
0000000000000000 D _binary_model_bin_start

We can see that we have slightly different symbols this time. In _binary_model_bin_start instead of model_bin, but more importantly, we need to use _binary_model_bin_end -_binary_model_bin_start to calculate the size.

// main.c
#include <stdio.h>

extern const unsigned char _binary_model_bin_start[]; // from obj file
extern const unsigned char _binary_model_bin_end[]; // from obj file

const unsigned char *data_start() { return _binary_model_bin_start; }

unsigned int data_size() { 
  return _binary_model_bin_end - _binary_model_bin_start; 
}

int main(){
  const unsigned char *data = data_start();
  unsigned int size = data_size();

  printf("model size: %u bytes\n", size);
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[i]);
  }
  printf("\n...\n");
  for (size_t i = 0; i < 4; i++) {
    printf("%02x ", data[size - 1 - i]);
  }
  printf("\n");
  return 0;
}

Let’s verify

gcc -o embed main.c model.o -Wall
./embed

model size: 65536 bytes
50 a4 3e 34 
...
81 15 4b ce

Method 3 (for macOS/arm64)

Unfortunately, method 2 does not work for macOS. Instead, what we can do is write an assembly model.s file to include the data from file.

; model.s
.section __DATA,__const
.global __binary_model_bin_start
.global __binary_model_bin_end
.align 2 ; 2^2 = 4-bytes alignment

__binary_model_bin_start:
.incbin "model.bin" ; include bytes from model.bin file
__binary_model_bin_end:

Now, we can compile this into an object file

as model.s -o model.o
nm model.o

0000000000010000 D __binary_model_bin_end
0000000000000000 D __binary_model_bin_start
0000000000000000 t ltmp0
0000000000000000 d ltmp1

Now, everything else should be exactly the same as method 2. That is, the same main.c function and the same compilation step

gcc -o embed main.c model.o -Wall
./embed

model size: 65536 bytes
50 a4 3e 34 
...
81 15 4b ce

Hopefully, this article helped you embed a large binary into your program or library. Happy coding!