From photo to coordinates

When you upload a photo to GeoPin, you get back a set of GPS coordinates — typically within seconds. But behind that simple interaction lies a multi-stage pipeline combining deep learning, high-performance vector search and classical computer vision. Here is how it all works.

Stage 1: Visual embedding with CosPlace

The first step is converting your photo into something a computer can reason about spatially. We use CosPlace, a visual place recognition model developed at the Polytechnic University of Turin, to generate a compact numerical representation of each image.

CosPlace is built on a ResNet-152 backbone — a deep convolutional neural network with 152 layers. The model is trained on a massive dataset of geotagged street-level images and has learned to produce embeddings where images of the same place cluster together in vector space, regardless of changes in lighting, season, weather or camera angle.

The output is a 512-dimensional vector. Think of it as a fingerprint of the visual scene: two photos taken at the same intersection will have similar fingerprints, even if one was shot on a sunny summer morning and the other on a rainy winter evening.

What makes CosPlace particularly effective is how it is trained. Unlike earlier approaches that required carefully curated image pairs, CosPlace uses a group-based training strategy that organises images by their GPS coordinates. This allows the model to learn robust place representations without the complex mining of positive and negative pairs that plagued earlier visual place recognition systems.

Stage 2: Vector search at scale

Once we have a 512-dimensional embedding for your search photo, we need to find the most similar embeddings in our reference database. This database contains millions of images covering the Netherlands at street level, each pre-processed through the same CosPlace model and stored alongside its known GPS coordinates.

Searching through millions of 512-dimensional vectors for the nearest neighbours sounds computationally expensive — and with a brute-force approach, it would be. This is where approximate nearest neighbour (ANN) search comes in.

We use Cloudflare Vectorize as our vector database. It implements the Hierarchical Navigable Small World (HNSW) algorithm, which builds a multi-layered graph structure over the vectors. Instead of comparing your query against every single reference image, HNSW navigates this graph to quickly locate the approximate nearest neighbours — trading a minimal loss in accuracy for dramatic speed improvements.

The result: from millions of candidates, we retrieve the top-K most visually similar reference images in milliseconds. Each candidate comes with known GPS coordinates, giving us an initial set of location hypotheses.

Stage 3: Re-ranking and filtering

Not all candidates from the vector search are equally reliable. The cosine similarity score between embeddings gives us a global measure of visual resemblance, but it can be misled by scenes that look superficially similar without being the same location. A generic overpass in Amsterdam can look similar to one in Rotterdam at the embedding level.

Before proceeding to the expensive geometric verification, we apply a filtering step. Candidates below a similarity threshold are discarded. We also analyse the spatial distribution of top candidates — if many independent matches cluster in the same geographic area, that is a strong signal, even before verification.

Stage 4: Geometric verification with LightGlue

This is where we move from “these scenes look alike” to “this is the same place.” Geometric verification establishes precise spatial correspondences between the search image and each candidate.

We use LightGlue, a lightweight feature matching model developed by ETH Zurich and Microsoft. The process works as follows:

Keypoint detection. We extract local features (keypoints and their descriptors) from both the search image and the candidate reference image using SuperPoint, a self-supervised keypoint detector. Each keypoint represents a distinctive visual element — a building corner, a texture pattern, the edge of a sign.
Feature matching. LightGlue takes the two sets of keypoints and learns which ones correspond to the same physical point in the scene. Unlike traditional matchers that compare descriptors independently, LightGlue uses attention mechanisms to factor in the global arrangement of keypoints, making it far more robust against viewpoint changes and occlusions.
Geometric consistency check. The matched keypoints must be geometrically consistent — they must satisfy the constraints of projective geometry (the epipolar constraint). We estimate a fundamental matrix using RANSAC and reject matches that do not fit the geometric model. If enough inliers survive this filtering, we have strong evidence that both images show the same physical scene.

The number of verified geometric inliers serves as our confidence score. A match with hundreds of geometrically consistent feature correspondences is almost certainly correct. A match with only a handful of inliers is far less reliable.

Putting it all together

The complete pipeline operates as a funnel:

Millions of reference images are indexed in vector space.
Hundreds of candidates are retrieved via approximate nearest neighbour search.
Dozens survive re-ranking and spatial filtering.
A few pass geometric verification with high confidence.

The final location estimate is derived from the verified matches. When multiple reference images from the same area pass geometric verification, we can triangulate an accurate position. The confidence score reflects both the number of verified matches and the geometric consistency of the evidence.

Why this architecture?

No single stage could do the job alone. Embeddings without verification produce too many false matches. Verification without embeddings is too slow. The combination gives us both speed and accuracy — metre-level precision, typically in under five seconds.