Visual geolocation has made more progress in the past three years than in the decade before. Models that once struggled to identify the correct country can now pinpoint a location to within a few hundred metres, and the pace of innovation shows no sign of slowing. At GeoPin, we spend a significant portion of our time tracking the research frontier and thinking about where this technology is headed. In this post, we share our perspective on the trends that will shape AI-powered geolocation over the coming years.

Multimodal models

The most impactful near-term development is the shift from models that operate on a single image and visual information alone to multimodal systems that combine visual information with other data types.

Today, GeoPin analyses the pixels of a photo and nothing else. But a photo rarely exists in isolation. It may have a caption, a timestamp, a username or surrounding text that provide contextual clues. A multimodal model could process all of this information jointly. Imagine a system that sees a photo of a canal, reads the caption “morning walk before work,” notes the timestamp of 7:45 CET, and integrates all three signals to refine its prediction. The visual features constrain the location to a set of canal streets; the timestamp and caption further constrain the likely neighbourhood based on commuting patterns and sunrise times.

Large vision-language models such as those in the GPT and Gemini families have already demonstrated impressive geographic reasoning when given photos and textual prompts. The challenge is combining this broad reasoning capability with the precision of a specialised place recognition model like CosPlace. We expect to see hybrid architectures emerge that use a vision-language model for coarse reasoning and a dedicated retrieval model for fine-grained localisation.

Satellite and aerial image fusion

Street-level photography offers a ground-level perspective but has inherent blind spots. Rural areas, private land and locations far from roads are underrepresented in street imagery datasets. Satellite and aerial imagery provides complementary coverage: a top-down view that captures landscape patterns, plot boundaries, building footprints and infrastructure layouts.

Fusing these two perspectives is technically challenging because they represent fundamentally different viewpoints of the same location. A building that appears as a tall facade in a street photo is a flat rectangle in a satellite image. Recent research on cross-view geolocalisation has made significant progress on this problem, with models trained to place ground-level and aerial images into a shared embedding space where the same location has similar representations regardless of viewpoint.

For the Netherlands, this is particularly promising. The country has excellent aerial coverage through programmes like the AHN (Actueel Hoogtebestand Nederland) LiDAR dataset and regular aerial photography campaigns. Integrating these with our existing street-level index could dramatically improve coverage in rural provinces such as Drenthe and Zeeland, where street imagery is sparse.

Real-time video geolocation

Photos are static snapshots, but much of the visual content shared online today is video. Livestreams, drone footage, dashcam recordings and short-form social media clips all contain rich sequential information that a single frame cannot capture.

Video geolocation introduces both opportunities and challenges. On the opportunity side, successive frames provide temporal consistency: the model can track features across frames, accumulate evidence over time, and use motion cues to infer direction and speed of travel. A ten-second clip of driving through a street contains far more information than any single frame from that clip.

On the challenge side, real-time processing demands are substantial. Geolocating a single image takes our pipeline roughly one to two seconds. Processing 30-frames-per-second video at the same accuracy would require a 30-fold increase in throughput, or more realistically, a fundamentally different architecture that shares computation across frames and only fully processes keyframes.

We are actively prototyping a video geolocation pipeline that uses lightweight tracking between keyframes and full CosPlace inference on selected frames. Early results are encouraging: by processing one in every fifteen frames and interpolating between predictions, we can achieve near-real-time video geolocation with only a modest accuracy penalty compared to frame-by-frame processing.

Temporal awareness

Locations change over time. A street photographed in 2020 may look different in 2026 due to construction, renovation, seasonal variation or urban development. Current geolocation models treat each image as timeless, matching it against the index without considering when the search image or the reference image was captured.

Future models will incorporate temporal awareness. If a search image shows a building under construction, the model should preferentially match against reference images from a similar period rather than images showing the completed building. This requires both temporally tagged reference data and model architectures that can reason about time.

For the Netherlands, where urban development proceeds rapidly and is well-documented, temporal indexing could also enable new applications. Urban planners could track neighbourhood changes over time. Historians could date archival photos by matching them against timestamped reference imagery. Insurance claims investigators could verify when damage to a property occurred by comparing against temporal snapshots of the same location.

Privacy-conscious geolocation

As geolocation technology becomes more powerful, privacy considerations become more pressing. The ability to determine where a photo was taken has legitimate and valuable applications in journalism, disaster response and heritage preservation, but it also raises concerns about surveillance and unwanted tracking.

The research community is exploring several approaches to responsible geolocation. One direction is differential privacy applied to predictions: returning an approximate region rather than exact coordinates when precise location is not necessary. Another is consent-based access control where the image owner can set granularity permissions. A third approach involves federated architectures where the image never leaves the user’s device; instead, the embedding is computed locally and only the anonymous feature vector is sent for matching.

At GeoPin, we have designed our system with privacy in mind from the beginning. We do not store uploaded images beyond what is needed for processing. We do not build profiles of users or their queries. And we are actively researching on-device embedding computation that would allow sensitive use cases to benefit from geolocation without transmitting the original image.

Beyond the Netherlands

GeoPin launched with full coverage of the Netherlands because focus produces better results than breadth. By concentrating on one country, we could build a dense, high-quality index and optimise our model for the specific visual characteristics of Dutch landscapes and architecture.

But the underlying technology is not inherently limited to the Netherlands. The same pipeline — data ingestion, quality filtering, GPU-accelerated embedding and vector retrieval — can be applied to any geography where sufficient reference imagery is available. Belgium, with its similar urban landscape, is a logical next step. Germany, France and the broader European Union offer increasingly large markets with strong demand for geolocation in journalism, law enforcement and cultural heritage.

Our roadmap envisions phased geographic expansion, with each new country receiving the same depth of indexing and accuracy benchmarking we applied to the Netherlands. We would rather cover ten countries with high accuracy than fifty with moderate accuracy.

What this means for users

For current GeoPin users, these trends translate into a product that becomes progressively more capable. You can expect improved accuracy on difficult categories like night images and rural photos, new input modalities like video, and expanded geographic coverage — all delivered through the same API and web interface you use today.

The future of AI-powered geolocation is not just about better models. It is about building responsible, transparent and genuinely useful tools that serve the people who need them: journalists verifying breaking news, researchers studying urban change, families tracing their heritage, and organisations protecting their communities. That mission drives everything we build.