Search
← Back to Blog

Vector search explained: how we match millions of images in seconds

What are vector embeddings, how does cosine similarity work, and how does Cloudflare Vectorize enable GeoPin to search millions of reference images in milliseconds?

Vector search explained: how we match millions of images in seconds

The needle in a million haystacks

GeoPin’s reference database contains millions of geotagged street-level images covering the Netherlands. When you upload a photo, we need to find the most visually similar images in that database — and it needs to happen in milliseconds, not minutes. This is a vector search problem, and solving it efficiently is one of the key technical challenges behind the service.

What are embeddings?

An embedding is a numerical representation of data in a continuous vector space. In our case, each image is converted by the CosPlace model into a list of 512 numbers (a 512-dimensional vector). These numbers are not individually meaningful — you cannot look at dimension 247 and say “that represents the number of windows in the image.” Instead, the meaning is distributed across all dimensions collectively.

The crucial property of well-trained embeddings is that similar items are close together in vector space. Two photos taken at the same intersection produce vectors that are close together. Two photos from opposite ends of the country produce vectors that are far apart. The embedding translates visual similarity into mathematical proximity.

Think of it this way: if you could somehow visualise a 512-dimensional space (you cannot, but bear with the analogy), all the photos of a particular Amsterdam canal house would form a tight cluster. Photos from the same neighbourhood would sit in the broader vicinity. Photos from an entirely different city would be in a distant region of the space.

Measuring similarity: cosine similarity

Given two embedding vectors, we need a way to quantify how similar they are. The standard measure is cosine similarity, which calculates the cosine of the angle between two vectors. The result ranges from -1 (opposite directions) to 1 (same direction), with 0 indicating no similarity.

Because our embeddings are L2-normalised (magnitude of 1), cosine similarity simplifies to just the dot product — comparing two 512-dimensional normalised vectors requires exactly 512 multiplications and 511 additions. For normalised vectors, cosine similarity and Euclidean distance produce the same ranking of nearest neighbours, but cosine similarity is preferred because it focuses on the direction of vectors, which better aligns with how embeddings encode place identity.

The brute-force problem

If our database contained 1,000 images, finding the nearest neighbours would be trivial. But with millions of images, a brute-force search requires roughly 2.5 billion floating-point operations per query. That does not scale gracefully — double the database, double the query time. This is where approximate nearest neighbour (ANN) algorithms become essential.

Approximate nearest neighbours: trading precision for speed

The insight behind ANN algorithms is that we rarely need the mathematically exact nearest neighbours. By accepting a minimal approximation, ANN algorithms achieve dramatic speed-ups — finding nearest neighbours in logarithmic rather than linear time.

HNSW: Hierarchical Navigable Small World

The ANN algorithm powering our search is HNSW (Hierarchical Navigable Small World). It organises vectors into a multi-layered graph:

  • Level 0 (bottom): Every image is a node, connected to its nearest neighbours. Dense and precise, but slow to traverse.
  • Level 1: A random subset of images, connected to their nearest neighbours within the subset. Sparser, allowing larger jumps.
  • Level 2 and above: Progressively smaller subsets enabling even larger jumps.

To search, you start at the top level and greedily move towards the query vector. At each level, once you cannot get closer, you descend and continue with more precision. At Level 0, you are in the right neighbourhood and only explore locally.

This means searching 39 million vectors may only require visiting a few hundred nodes — a dramatic improvement over brute force.

Cloudflare Vectorize

GeoPin runs on Cloudflare’s infrastructure, and we use Cloudflare Vectorize as our vector database. This choice reflects our architectural principle of keeping everything on one platform for simplicity and performance.

Vectorize stores our pre-computed CosPlace embeddings alongside metadata (GPS coordinates, image identifiers, capture timestamps). When a query arrives, it performs the ANN search and returns the top-K nearest neighbours with similarity scores and metadata.

Key features: edge deployment across Cloudflare’s global network for low-latency queries; metadata filtering so searches can be constrained to specific regions or time ranges; and scalability as our reference database grows without us managing infrastructure.

From vector match to location

The output of the vector search is a ranked list of reference images, each with a cosine similarity score and GPS coordinates. But this is not the final answer — it is the candidate set. Visually similar scenes can exist at different locations, which is why our pipeline follows the vector search with geometric verification using LightGlue to confirm that candidates show the same physical scene.

To give a concrete sense of scale:

  • Embedding size: 512 dimensions, 32-bit float = 2 KB per image
  • Database size: Millions of embeddings = several gigabytes of vector data
  • Query time: Typically under 50 milliseconds for top-100 retrieval
  • Recall: HNSW achieves approximately 95%+ recall compared to exact search

From the moment your photo’s embedding is computed, the relevant part of the Netherlands is identified in a fraction of a second. The geometric verification that confirms the match can then focus its computation on the most promising candidates.

Vector search is not glamorous technology. But without it, the entire pipeline would be impractically slow. It is the infrastructure that makes real-time geolocation at scale possible.