Table of Contents
On the vector indexing and querying part, Jina has implemented a baseline vector indexer called
NumpyIndexer, a vector indexer that is purely based on numpy. The implementation pretty straightforward: it writes vectors directly to the disk and queries nearest neighbors via dot product. It is simple, requires no extra dependencies, and the performance was reasonable on small data. As the default vector indexer, we have been using it since day one when showcasing quick demos, toy examples, and tutorials.
Recently, this community issue has raised my attention. I realize there is a space of improvement, even for this baseline indexer. In the end, I manage to improve the index and query speed by 1.6x and 2.8x while keeping the memory footprint constant (i.e., invariant to the size of the index data). This blog post summarizes the tricks I used.
Don't forget to tag @hanxiao in your comment, otherwise they may not be notified.