Getting started with DiskANN Link to heading
DiskANN is a graph-based indexing and search system that can perform fast and accurate approximate nearest neighbor (ANN) search on large-scale vector datasets using a single node with limited memory and an SSD. DiskANN can index and search a billion points with high recall and low latency, while using much less memory than state-of-the-art ANN search algorithms. DiskANN is based on a novel graph index called Vamana that is more versatile than existing graph indices.
In this blog, we will run DiskANN end-to-end. First thing first, we want to clone the repo
git clone https://github.com/microsoft/DiskANN.git
cd DiskANN
We will use the Dockerfile to build the docker image
docker build -t diskann .
Let’s run the image.
# make sure to allocate memory according to your system config
docker run -it --name diskann --memory="4g" diskann
Now, we will follow the example here closely within the docker container. Let’s install additional packages:
apt-get install -y wget neovim htop
First, we need dataset. We will use 1M sift dataset (ANN_SIFT1M), as our starting point.
cd /app/DiskANN
mkdir build/data && cd build/data
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xf sift.tar.gz
ls -lh sift
The dataset consists 4 files:
- sift_base.fvecs: database file from which search is to be perfomed
- sift_learn.fvecs: a smaller size of database used for parameter tuning
- sift_query.fvecs: query vectors
- sift_groundtruth.ivecs: ground truth
For this exercise, let’s use the smaller set, i.e., sift_learn.fvecs file to build index search. This file contains 100k vectors of dimension 128 of float32 type. We first need to convert the vector files into the format DiskANN can read.
cd .. # /app/DiskANN/build
tests/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin
This converts fvecs format into fbin format. We can print it to inspect the data now that we have it in the bin format.
tests/utils/bin_to_tsv float data/sift/sift_learn.fbin /dev/stdout | less
We can see that the vectors, although in float-type, have integer values. OK, let’s convert the query vectors into binary format as well.
tests/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin
Because we will be using the sift_learn file, instead of the sift_base file, we need to compute the ground truth, i.e., k-nearest-neighbor (KNN) search. This can be done using brute-force algorithm. DiskANN provides a utility for that.
./tests/utils/compute_groundtruth --help
Arguments:
-h [ --help ] Print information on arguments
--data_type arg data type <int8/uint8/float>
--dist_fn arg distance function <l2/mips>
--base_file arg File containing the base vectors in binary format
--query_file arg File containing the query vectors in binary format
--gt_file arg File name for the writing ground truth in binary
format, please don't append .bin at end if no
filter_label or filter_label_file is provided it
will save the file with '.bin' at end.else it will
save the file as filename_label.bin
--K arg Number of ground truth nearest neighbors to compute
--tags_file arg File containing the tags in binary format
We need to provide all the arguments, except for --tags_file. Let’s search for 100 nearest neighbors.
tests/utils/compute_groundtruth \
--data_type float \
--dist_fn l2 \
--base_file data/sift/sift_learn.fbin \
--query_file data/sift/sift_query.fbin \
--gt_file data/sift/sift_query_learn_gt100 \
--K 100
Since the database consists of only 100k vectors, the brute-force KNN computation is quite fast. After this command, you should find sift_query_learn_gt100 file in /app/DiskANN/build/data/sift folder. The file format is
num_points (i32)
dim (i32)
matrix of [num_points, dim, index (i32)]
matrix of [num_points, dim, distance (float32)]
In ANN, there are two steps. The first is to build an index, which is to employ some clever data structure to store all database vectors such that we can quickly and accurately find approximate nearest neighboring points within the database for any given query. The second step is to do the actual search using this clever data structure.
To do the first step, i.e., build the index, we invoke build_disk_index utility
./tests/build_disk_index --help
Arguments:
-h [ --help ] Print information on arguments
--data_type arg data type <int8/uint8/float>
--dist_fn arg distance function <l2/mips>
--data_path arg Input data file in bin format
--index_path_prefix arg Path prefix for saving index file
components
-R [ --max_degree ] arg (=64) Maximum graph degree
-L [ --Lbuild ] arg (=100) Build complexity, higher value results in
better graphs
-B [ --search_DRAM_budget ] arg DRAM budget in GB for searching the index
to set the compressed level for data while
search happens
-M [ --build_DRAM_budget ] arg DRAM budget in GB for building the index
-T [ --num_threads ] arg (=12) Number of threads used for building index
(defaults to omp_get_num_procs())
--QD arg (=0) Quantized Dimension for compression
--codebook_prefix arg Path prefix for pre-trained codebook
--PQ_disk_bytes arg (=0) Number of bytes to which vectors should be
compressed on SSD; 0 for no compression
--append_reorder_data Include full precision data in the index.
Use only in conjuction with compressed
data on SSD.
--build_PQ_bytes arg (=0) Number of PQ bytes to build the index; 0
for full precision build
--use_opq Use Optimized Product Quantization (OPQ).
--label_file arg Input label file in txt format for
Filtered Index build .The file should
contain comma separated filters for each
node with each line corresponding to a
graph node
--universal_label arg Universal label, Use only in conjuction
with label file for filtered index build.
If a graph node has all the labels against
it, we can assign a special universal
filter to the point instead of comma
separated filters for that point
--FilteredLbuild arg (=0) Build complexity for filtered points,
higher value results in better graphs
-F [ --filter_threshold ] arg (=0) Threshold to break up the existing nodes
to generate new graph internally where
each node has a maximum F labels.
--label_type arg (=uint) Storage type of Labels <uint/ushort>,
default value is uint which will consume
memory 4 bytes per filter
There are a ton of parameters here, so let’s stick with what the tutorial provides
tests/build_disk_index \
--data_type float \
--dist_fn l2 \
--data_path data/sift/sift_learn.fbin \
--index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 \
-R 32 \
-L50 \
-B 0.003 \
-M 1
Finally, we can run the search
tests/search_disk_index \
--data_type float \
--dist_fn l2 \
--index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 \
--query_file data/sift/sift_query.fbin \
--gt_file data/sift/sift_query_learn_gt100 \
-K 10 \
-L 10 20 30 40 50 100 \
--result_path data/sift/res \
--num_nodes_to_cache 10000
This outputs a nice table of search result along with recall:
L Beamwidth QPS Mean Latency 99.9 Latency Mean IOs CPU (s) Recall@10
======================================================================================================================
10 2 11142.41 1059.91 11968.00 9.10 5.30 80.49
20 2 6280.56 1880.91 3457.00 16.45 27.94 95.41
30 2 4275.08 2769.92 23823.00 24.03 50.41 98.19
40 2 3294.44 3604.62 14347.00 31.71 59.72 99.02
50 2 2652.86 4489.14 15791.00 39.45 54.30 99.40
100 2 1335.79 8939.39 30558.00 78.62 111.84 99.85