“How accurate is it?” is the first question everyone asks about an AI geolocation service. It is also the hardest question to answer honestly, because accuracy depends heavily on what you are trying to geolocate, where it is located, and what “accurate” means in your context. Rather than giving you a single number, we want to walk through how we measure performance, what our benchmarks show, and where the technology has genuine limitations.
How we measure accuracy
Geolocation accuracy is typically measured in distance error: how far is the predicted location from the true location? But a single average distance is misleading. A system that is accurate to 10 metres in Amsterdam but off by 5 kilometres in rural Drenthe would show a reasonable average while being useless for half the country.
We use multiple complementary metrics:
Median error — the distance at which half the predictions are closer and half are further away. This is more representative than mean error, which gets skewed by outliers.
Recall at distance thresholds — what percentage of queries yield a result within 25 metres, 100 metres, 1 kilometre and 5 kilometres of the true location. This tells you the probability of getting a useful answer at different precision levels.
Confidence correlation — how well does our reported confidence score predict actual accuracy? A confidence score is only useful if high-confidence results are genuinely more accurate than low-confidence results.
CosPlace: the foundation
GeoPin’s visual place recognition is built on CosPlace, a model developed at the Polytechnic University of Turin specifically for visual geolocalisation. CosPlace uses a ResNet-152 backbone to produce 512-dimensional embeddings that capture the visual identity of places.
Unlike classification models that assign images to discrete location categories, CosPlace learns a continuous embedding space where visually similar places are close together. It can match locations it has never seen during training, as long as visually similar reference images exist in our database.
On the Pittsburgh 30k benchmark, CosPlace with ResNet-152 achieves recall@1 above 90%. On the more challenging MSLS (Mapillary Street Level Sequences) dataset, which includes appearance changes across seasons, performance is lower but competitive with state-of-the-art methods.
Our Netherlands-specific benchmarks
Academic benchmarks are useful for comparing models but do not tell you how well the system performs on Dutch streets. We maintain an internal evaluation set of 5,000 geotagged photos from across the Netherlands, stratified by province and urban/rural classification. These photos are held out from our reference index to prevent data leakage.
Here is what we observe on our evaluation set:
City centres (Amsterdam, Rotterdam, The Hague, Utrecht): Median error of roughly 15-30 metres. Recall within 100 metres exceeds 75%. Dense reference imagery and distinctive architecture contribute to strong performance. Canal houses, shopping districts and major intersections are particularly well matched.
Suburbs: Median error of roughly 40-80 metres. Recall within 100 metres is around 55-65%. Repetitive residential styles are harder to distinguish, but street patterns and plantings still provide useful signals.
Rural areas: Median error rises to 150-500 metres. Recall within 100 metres drops to 25-35%. Open farmland, generic country roads and sparse reference coverage all contribute to lower accuracy. However, recall within 1 kilometre remains above 60%, which is often sufficient for rural investigation contexts where block-level precision is not expected.
Industrial zones and ports: Performance varies significantly. Distinctive infrastructure like cranes, silos and specialised buildings matches well. Generic warehouse districts are harder.
The verification step makes the difference
The numbers above reflect the full GeoPin pipeline, including our geometric verification stage with DISK feature extraction and LightGlue matching. Without this step, relying solely on CosPlace embedding similarity, accuracy drops noticeably — particularly the rate of high-confidence false positives.
Geometric verification catches cases where two locations look globally similar but differ in structural details such as window patterns and roofline geometry. The verification stage eliminates 30-40% of incorrect top candidates, significantly improving precision.
When our system returns a high-confidence result — say more than 50 matched geometric features — the probability of that result being within 50 metres of the true location exceeds 85%. Confidence scores below 20 matched features are significantly less reliable and should be treated as approximate.
How does this compare to humans?
GeoGuessr players represent the best human benchmark. Top players localise locations to within a few hundred metres using language clues, road markings, vegetation, sun position and cultural context. For the Netherlands, experts can recognise provinces from road surface texture and distinguish canal belts by bridge railing styles.
In a head-to-head comparison, the picture is nuanced. For distinctive urban locations, humans and GeoPin perform comparably. Humans excel at reading text (shop signs, street name signs) and understanding cultural context. GeoPin excels at consistency, speed and searching millions of reference images no human can memorise.
GeoPin wins clearly on scale. A human expert spends 2-5 minutes per image; GeoPin delivers results in seconds. For batch-processing hundreds of photos, automated geolocation is the only viable option. Humans win clearly on textual clues — a road sign saying “Appingedam 5 km” is trivial for a human and difficult for a visual matching system.
Honest limitations
Transparent benchmarking means acknowledging where the technology falls short.
Seasonal variation. Winter photos may not match against summer reference images. We include images from multiple seasons in our index, but cross-seasonal coverage is not uniform.
Construction and change. Buildings get renovated, streets get redesigned. Weekly index updates help, but there will always be lag.
Interior photos. GeoPin handles street-level exterior geolocation. Interior photos do not match against our reference imagery.
Unusual angles. Aerial photos, extreme close-ups and heavily cropped images perform poorly against street-level references.
What the numbers mean for you
For OSINT or journalistic verification, you can rely on results in urban areas and treat rural results as approximate starting points. Always verify with additional evidence and use the confidence score as a genuine reliability indicator.
We publish these benchmarks because accuracy claims without methodology are meaningless. You deserve to know what the tool can and cannot do before you rely on it.