When we set out to build GeoPin, we knew accuracy would stand or fall with the quality of our reference index. A geolocation model is only as good as the data it can match against. For the Netherlands, that meant building a comprehensive visual index covering every province, every municipality, and as many streets, canals and country roads as possible. Today, our index contains more than 39 million images, and in this post we want to give a behind-the-scenes look at how we built it.

The multi-source challenge

No single imagery source covers the entire Netherlands. Government open-data portals, crowdsourced street photography, satellite and aerial imagery, and historical archives each have different strengths. Our pipeline had to ingest from all of these while handling widely varying formats, resolutions, metadata schemas and licence terms.

We started by cataloguing every publicly available source of geotagged imagery covering Dutch territory. The major contributors are municipal panoramic captures, open aerial photography from PDOK (Publieke Dienstverlening Op de Kaart), and community-contributed street photos. Each source required a specific ingestion adapter: some deliver tiled map layers, others offer bulk downloads of geotagged JPEGs, and a few expose streaming APIs.

Normalising this data was one of the hardest early engineering challenges. We wrote a common schema that captures GPS coordinates, capture timestamp, compass heading, camera parameters and source provenance. Every image entering the pipeline is first translated to this schema before anything else happens. That consistency pays off downstream when we need to filter, deduplicate or audit the index.

Deduplication and quality filtering

Five million images sounds impressive, but raw volume means nothing if the index is full of duplicates, blurry frames or images that show nothing but the inside of a camera bag. Before an image reaches the embedding stage, it passes through a multi-step quality check.

First, we compute perceptual hashes to detect near-duplicates. Two photos taken one second apart from a moving vehicle are nearly identical, and retaining both would waste storage and slow retrieval without improving accuracy. Our deduplication step reduces some source datasets by as much as 40 percent.

Next, a lightweight classification model scores each image for blur, occlusion and exposure issues. Images falling below a quality threshold are flagged and excluded from the primary index, though we retain them in cold storage in case future models can extract value from lower-quality inputs.

Finally, we validate GPS metadata. Surprisingly, a non-trivial fraction of geotagged photos carry coordinates that are clearly wrong: points in the North Sea, coordinates rounded to whole degrees, or locations entirely outside the Netherlands. We check against administrative boundary polygons and discard anything that does not land on Dutch soil.

GPU-accelerated embedding pipeline

The core of GeoPin’s matching engine is CosPlace, a visual place recognition model that produces dense feature embeddings for each image. Generating embeddings for five million images is computationally intensive. Each image must be resized, normalised and passed through a deep neural network to produce a 512-dimensional feature vector.

We run this pipeline on multi-GPU nodes equipped with NVIDIA A100 accelerators. Batch processing is essential: by feeding images through the model in batches of 256, we keep GPU utilisation above 90 percent and can process roughly 1,200 images per second per GPU. At that rate, the full five-million-image corpus can be processed in under two hours on a four-GPU node.

The embeddings are stored in a vector database optimised for approximate nearest-neighbour search. When a user uploads a search image, GeoPin generates an embedding using the same CosPlace model and retrieves the closest matches from the index. The top candidates are then re-ranked using a more compute-intensive scoring step that factors in geometric consistency and metadata plausibility.

Cloudflare at the edge

Low latency matters. When a journalist or investigator uploads a photo, they want results in seconds, not minutes. We chose Cloudflare’s global network as the backbone for serving both the web application and the API.

Static assets and the Astro frontend are deployed via Cloudflare Pages, giving us instant global distribution with automatic cache invalidation on every deploy. API requests hit Cloudflare Workers first, where we handle authentication, rate limiting and request validation before forwarding to our GPU inference backend. This architecture means a request from Amsterdam and a request from New York both experience consistently low response times for everything except the actual model inference step, which runs in a centralised GPU cluster.

For the vector index itself, we use a tiered caching strategy. The most frequently queried embeddings, which tend to cluster around major cities like Amsterdam, Rotterdam and The Hague, are cached in memory on the inference nodes. Less common regions are loaded on demand from fast NVMe storage. This approach balances memory costs against lookup speed and keeps our p95 response time under two seconds for end-to-end geolocation queries.

Keeping the index current

The Netherlands is not a static place. New buildings go up, old ones come down, roads get rerouted and seasonal changes alter the landscape. An index that was perfect in January may lose accuracy by July if it is never updated.

We run incremental index updates on a monthly cycle. New images from our source feeds are ingested, deduplicated, quality-checked and processed into embeddings, just like the initial corpus. We also reprocess images from areas where users have reported low accuracy, prioritising those regions in the next update window.

Index versioning is critical for reproducibility. Each index build receives a version identifier, and we can roll back to any previous version within minutes if a new build introduces a regression. This is especially important as we continuously benchmark accuracy against a held-out test set of manually verified geolocations.

What we learned

Building a nationwide image index is as much a data engineering problem as it is a machine learning problem. The model gets the headlines, but the pipeline that feeds it determines how well it actually performs in production. Investing early in data quality, deduplication and metadata validation saved us countless hours debugging mysterious accuracy drops later on.

If you are building something similar, our advice is simple: treat your data pipeline with the same care as your model architecture. The best model in the world cannot compensate for a messy, incomplete index.

In a future post, we will dive deeper into how CosPlace embeddings work and why we chose this architecture over alternatives like NetVLAD and patch-based retrieval. Stay tuned.