How to Embed a Large File into a Program
Imagine your program needs to read from a large file to function, but you want to ship it as a single executable for security or ease of distribution. In this article, we will explore how to embed such a file into your program.
Scenario
Suppose we have a file, model.bin, that is essential for our program to work. Typically, you would package the executable and the model file together. However, this approach has drawbacks:
- Cumbersome packaging: Managing two separate files can be inconvenient. If either is misplaced, the program fails.
- Security concerns: A client could replace the model file, leading to unintended behavior.
Our goal is to embed model.bin directly into the executable, creating a single-file solution.
For simplicity, let’s generate model.bin as 64 KB of random data. In practice, it could be several gigabytes (e.g., for large language models):
head -c65536 /dev/random > model.bin
Here’s what our program will do with the file’s content:
// main.c
#include <stdio.h>
const unsigned char *data_start() { /* TODO */ }
unsigned int data_size() { /* TODO */ }
int main(){
const unsigned char *data = data_start();
unsigned int size = data_size();
printf("model size: %u bytes\n", size);
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[i]);
}
printf("\n...\n");
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[size - 1 - i]);
}
printf("\n");
return 0;
}
Method 1
The easiest method is to serialize the model file as a model.c file and generate the object
xxd -i model.bin > model.c
This generates a C array containing the raw bytes of model.bin:
// model.c
unsigned char model_bin[] = {
0x50, 0xa4, 0x3e, 0x34, /* ... */ 0x81, 0x15, 0x4b, 0xce
};
unsigned int model_bin_len = 65536;
Compile this into an object file:
gcc -c model.c -o model.o
Modify main.c to use this embedded array:
// main.c
#include <stdio.h>
extern const unsigned char model_bin[]; // from model.o file
extern const unsigned int model_bin_len; // from model.o file
const unsigned char *data_start() { return model_bin; }
unsigned int data_size() { return model_bin_len; }
int main(){
const unsigned char *data = data_start();
unsigned int size = data_size();
printf("model size: %u bytes\n", size);
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[i]);
}
printf("\n...\n");
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[size - 1 - i]);
}
printf("\n");
return 0;
}
Now, we can compile and run the program to verify
gcc -o embed main.c model.o -Wall
./embed
model size: 65536 bytes
50 a4 3e 34
...
81 15 4b ce
This method is simple, but it has a serious limitation—it will not work with a large file. Not only will the model.c file be ~6x in size of the model.bin, but also the compiler will refuse to compile it if the file is too large. For example, with ~1GB of size, we get the following error with llvm
gcc -c model.c -o model.o
error: sorry, unsupported: file 'model.c' is too large for Clang to process
fatal error: error opening file '<invalid buffer>':
2 errors generated.
Method 2 (for Linux)
Another way is to use objcopy to directly generate the object file without generating the source file.
# on x64 system
objcopy -I binary -O elf64-x86-64 model.bin model.o
To use this object file, we first need to examine what symbols are defined.
nm model.o
0000000000010000 D _binary_model_bin_end
0000000000010000 A _binary_model_bin_size
0000000000000000 D _binary_model_bin_start
We can see that we have slightly different symbols this time. In _binary_model_bin_start instead of model_bin, but more importantly, we need to use _binary_model_bin_end -_binary_model_bin_start to calculate the size.
// main.c
#include <stdio.h>
extern const unsigned char _binary_model_bin_start[]; // from obj file
extern const unsigned char _binary_model_bin_end[]; // from obj file
const unsigned char *data_start() { return _binary_model_bin_start; }
unsigned int data_size() {
return _binary_model_bin_end - _binary_model_bin_start;
}
int main(){
const unsigned char *data = data_start();
unsigned int size = data_size();
printf("model size: %u bytes\n", size);
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[i]);
}
printf("\n...\n");
for (size_t i = 0; i < 4; i++) {
printf("%02x ", data[size - 1 - i]);
}
printf("\n");
return 0;
}
Let’s verify
gcc -o embed main.c model.o -Wall
./embed
model size: 65536 bytes
50 a4 3e 34
...
81 15 4b ce
Method 3 (for macOS/arm64)
Unfortunately, method 2 does not work for macOS. Instead, what we can do is write an assembly model.s file to include the data from file.
; model.s
.section __DATA,__const
.global __binary_model_bin_start
.global __binary_model_bin_end
.align 2 ; 2^2 = 4-bytes alignment
__binary_model_bin_start:
.incbin "model.bin" ; include bytes from model.bin file
__binary_model_bin_end:
Now, we can compile this into an object file
as model.s -o model.o
nm model.o
0000000000010000 D __binary_model_bin_end
0000000000000000 D __binary_model_bin_start
0000000000000000 t ltmp0
0000000000000000 d ltmp1
Now, everything else should be exactly the same as method 2. That is, the same main.c function and the same compilation step
gcc -o embed main.c model.o -Wall
./embed
model size: 65536 bytes
50 a4 3e 34
...
81 15 4b ce
Hopefully, this article helped you embed a large binary into your program or library. Happy coding!