Getting started with DiskANN Link to heading

DiskANN is a graph-based indexing and search system that can perform fast and accurate approximate nearest neighbor (ANN) search on large-scale vector datasets using a single node with limited memory and an SSD. DiskANN can index and search a billion points with high recall and low latency, while using much less memory than state-of-the-art ANN search algorithms. DiskANN is based on a novel graph index called Vamana that is more versatile than existing graph indices.

In this blog, we will run DiskANN end-to-end. First thing first, we want to clone the repo

git clone https://github.com/microsoft/DiskANN.git
cd DiskANN

We will use the Dockerfile to build the docker image

docker build -t diskann .

Let’s run the image.

# make sure to allocate memory according to your system config
docker run -it --name diskann --memory="4g" diskann

Now, we will follow the example here closely within the docker container. Let’s install additional packages:

apt-get install -y wget neovim htop

First, we need dataset. We will use 1M sift dataset (ANN_SIFT1M), as our starting point.

cd /app/DiskANN
mkdir build/data && cd build/data
wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xf sift.tar.gz
ls -lh sift

The dataset consists 4 files:

  • sift_base.fvecs: database file from which search is to be perfomed
  • sift_learn.fvecs: a smaller size of database used for parameter tuning
  • sift_query.fvecs: query vectors
  • sift_groundtruth.ivecs: ground truth

For this exercise, let’s use the smaller set, i.e., sift_learn.fvecs file to build index search. This file contains 100k vectors of dimension 128 of float32 type. We first need to convert the vector files into the format DiskANN can read.

cd .. # /app/DiskANN/build
tests/utils/fvecs_to_bin float data/sift/sift_learn.fvecs data/sift/sift_learn.fbin

This converts fvecs format into fbin format. We can print it to inspect the data now that we have it in the bin format.

tests/utils/bin_to_tsv float data/sift/sift_learn.fbin /dev/stdout | less

We can see that the vectors, although in float-type, have integer values. OK, let’s convert the query vectors into binary format as well.

tests/utils/fvecs_to_bin float data/sift/sift_query.fvecs data/sift/sift_query.fbin

Because we will be using the sift_learn file, instead of the sift_base file, we need to compute the ground truth, i.e., k-nearest-neighbor (KNN) search. This can be done using brute-force algorithm. DiskANN provides a utility for that.

./tests/utils/compute_groundtruth --help
Arguments:
  -h [ --help ]         Print information on arguments
  --data_type arg       data type <int8/uint8/float>
  --dist_fn arg         distance function <l2/mips>
  --base_file arg       File containing the base vectors in binary format
  --query_file arg      File containing the query vectors in binary format
  --gt_file arg         File name for the writing ground truth in binary 
                        format, please don't append .bin at end if no 
                        filter_label or filter_label_file is provided it 
                        will save the file with '.bin' at end.else it will 
                        save the file as filename_label.bin
  --K arg               Number of ground truth nearest neighbors to compute
  --tags_file arg       File containing the tags in binary format

We need to provide all the arguments, except for --tags_file. Let’s search for 100 nearest neighbors.

tests/utils/compute_groundtruth \
  --data_type float \
  --dist_fn l2 \
  --base_file data/sift/sift_learn.fbin \
  --query_file data/sift/sift_query.fbin \
  --gt_file data/sift/sift_query_learn_gt100 \
  --K 100

Since the database consists of only 100k vectors, the brute-force KNN computation is quite fast. After this command, you should find sift_query_learn_gt100 file in /app/DiskANN/build/data/sift folder. The file format is

num_points (i32)
dim (i32)
matrix of [num_points, dim, index (i32)]
matrix of [num_points, dim, distance (float32)]

In ANN, there are two steps. The first is to build an index, which is to employ some clever data structure to store all database vectors such that we can quickly and accurately find approximate nearest neighboring points within the database for any given query. The second step is to do the actual search using this clever data structure.

To do the first step, i.e., build the index, we invoke build_disk_index utility

./tests/build_disk_index --help
Arguments:
  -h [ --help ]                      Print information on arguments
  --data_type arg                    data type <int8/uint8/float>
  --dist_fn arg                      distance function <l2/mips>
  --data_path arg                    Input data file in bin format
  --index_path_prefix arg            Path prefix for saving index file 
                                     components
  -R [ --max_degree ] arg (=64)      Maximum graph degree
  -L [ --Lbuild ] arg (=100)         Build complexity, higher value results in 
                                     better graphs
  -B [ --search_DRAM_budget ] arg    DRAM budget in GB for searching the index 
                                     to set the compressed level for data while 
                                     search happens
  -M [ --build_DRAM_budget ] arg     DRAM budget in GB for building the index
  -T [ --num_threads ] arg (=12)     Number of threads used for building index 
                                     (defaults to omp_get_num_procs())
  --QD arg (=0)                       Quantized Dimension for compression
  --codebook_prefix arg              Path prefix for pre-trained codebook
  --PQ_disk_bytes arg (=0)           Number of bytes to which vectors should be 
                                     compressed on SSD; 0 for no compression
  --append_reorder_data              Include full precision data in the index. 
                                     Use only in conjuction with compressed 
                                     data on SSD.
  --build_PQ_bytes arg (=0)          Number of PQ bytes to build the index; 0 
                                     for full precision build
  --use_opq                          Use Optimized Product Quantization (OPQ).
  --label_file arg                   Input label file in txt format for 
                                     Filtered Index build .The file should 
                                     contain comma separated filters for each 
                                     node with each line corresponding to a 
                                     graph node
  --universal_label arg              Universal label, Use only in conjuction 
                                     with label file for filtered index build. 
                                     If a graph node has all the labels against 
                                     it, we can assign a special universal 
                                     filter to the point instead of comma 
                                     separated filters for that point
  --FilteredLbuild arg (=0)          Build complexity for filtered points, 
                                     higher value results in better graphs
  -F [ --filter_threshold ] arg (=0) Threshold to break up the existing nodes 
                                     to generate new graph internally where 
                                     each node has a maximum F labels.
  --label_type arg (=uint)           Storage type of Labels <uint/ushort>, 
                                     default value is uint which will consume 
                                     memory 4 bytes per filter

There are a ton of parameters here, so let’s stick with what the tutorial provides

tests/build_disk_index \
  --data_type float \
  --dist_fn l2 \
  --data_path data/sift/sift_learn.fbin \
  --index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 \
  -R 32 \
  -L50 \
  -B 0.003 \
  -M 1

Finally, we can run the search

tests/search_disk_index \
  --data_type float \
  --dist_fn l2 \
  --index_path_prefix data/sift/disk_index_sift_learn_R32_L50_A1.2 \
  --query_file data/sift/sift_query.fbin \
  --gt_file data/sift/sift_query_learn_gt100 \
  -K 10 \
  -L 10 20 30 40 50 100 \
  --result_path data/sift/res \
  --num_nodes_to_cache 10000

This outputs a nice table of search result along with recall:

     L   Beamwidth             QPS    Mean Latency    99.9 Latency        Mean IOs         CPU (s)       Recall@10
======================================================================================================================
    10           2        11142.41         1059.91        11968.00            9.10            5.30           80.49
    20           2         6280.56         1880.91         3457.00           16.45           27.94           95.41
    30           2         4275.08         2769.92        23823.00           24.03           50.41           98.19
    40           2         3294.44         3604.62        14347.00           31.71           59.72           99.02
    50           2         2652.86         4489.14        15791.00           39.45           54.30           99.40
   100           2         1335.79         8939.39        30558.00           78.62          111.84           99.85