Numpy Tricks and A Strong Baseline for Vector Index
Tricks used to improve the index and query speed by 1.6x and 2.8x while keeping the memory footprint constant.
search vector-index numpy memmap article code

Table of Contents

  • The Scalability Problem
  • numpy.memmap Instead of numpy.frombuffer
  • Batching with Care
  • Lifecycle of memmap
    • Zero-copy slicing
    • Memory-efficient Euclidean and Cosine
  • Removing gzip compression
  • Summary

On the vector indexing and querying part, Jina has implemented a baseline vector indexer called NumpyIndexer, a vector indexer that is purely based on numpy. The implementation pretty straightforward: it writes vectors directly to the disk and queries nearest neighbors via dot product. It is simple, requires no extra dependencies, and the performance was reasonable on small data. As the default vector indexer, we have been using it since day one when showcasing quick demos, toy examples, and tutorials.

Recently, this community issue has raised my attention. I realize there is a space of improvement, even for this baseline indexer. In the end, I manage to improve the index and query speed by 1.6x and 2.8x while keeping the memory footprint constant (i.e., invariant to the size of the index data). This blog post summarizes the tricks I used.

Don't forget to tag @hanxiao in your comment, otherwise they may not be notified.

Authors community post
Founder @jina-ai | Creator of Fashion-MNIST & bert-as-service
Share this project
Similar projects
Anti-Patterns in NLP (8 types of NLP idiots)
A talk which discusses the recurring industrial problems in making NLP solutions.
Haystack — Neural Question Answering At Scale
Scaling Question Answering models to find answers in large document stores via retriever and reader approach.
Txtai: AI-powered search engine
AI-powered search engine.
Document search engine
NLP based search engine for single page pdf files.