The challenge of visual place recognition

Imagine showing a computer two photos: one taken on a sunny July afternoon, the other at the same intersection during a foggy November morning. The lighting is completely different. Some trees have leaves in one photo and bare branches in the other. Different cars are parked on the street. A shop has changed its sign. For a human, with enough attention, these are recognisably the same place. For a naive computer vision system, they might look entirely different.

This is the core challenge of visual place recognition (VPR): determining whether two images depict the same location despite dramatic changes in appearance. CosPlace, developed by Gabriele Berton and colleagues at the Polytechnic University of Turin, represents a significant advance in solving this problem.

Why ResNet-152?

The backbone of CosPlace is ResNet-152, a convolutional neural network architecture with 152 layers. The choice is worth examining.

ResNet introduced skip connections — shortcuts that allow information to bypass layers — solving the degradation problem where adding depth could actually reduce performance. With skip connections, depth becomes an asset. ResNet-152 learns progressively more abstract visual features:

Early layers detect low-level features: edges, corners, colour gradients, simple textures.
Middle layers combine these into higher-level patterns: window shapes, rooflines, road surfaces, vegetation textures.
Deep layers encode complex spatial arrangements: the relationship between a building facade and the street in front of it, the pattern of a neighbourhood’s layout.

For place recognition, this depth matters. Recognising a place is not about matching a single feature — it is about encoding the overall spatial arrangement of many features. A 152-layer network has the capacity to build rich, layered representations that capture these complex relationships.

The choice of ResNet-152 over newer architectures like Vision Transformers (ViTs) reflects a practical trade-off. While ViTs have shown strong performance on many vision tasks, ResNet-152 offers a well-understood, highly optimised architecture with excellent inference speed. For a production system processing thousands of images, this efficiency matters.

The training innovation: groups, not pairs

CosPlace’s key innovation is not the backbone architecture but how the model is trained. Earlier VPR methods suffered from a persistent bottleneck: pair mining.

Traditional metric learning for place recognition required carefully selecting training pairs — images of the same place (positive pairs) and images of different places (negative pairs). The quality of these pairs had a dramatic effect on the resulting model. Too-easy negatives, and the model would not learn to handle challenging cases. Too-hard negatives, and training would collapse. Finding the right balance required expensive mining procedures that scaled poorly.

CosPlace sidesteps this entirely with a group-based classification approach. Here is how it works:

1. Geographic partitioning. The training area is divided into a grid of geographic cells. Each cell represents a distinct “place.”

2. Group assignment. All training images are assigned to groups based on the geographic cell in which they were photographed. Images in the same cell belong to the same group.

3. Classification training. The model is trained as a classifier: given an image, predict which geographic cell it belongs to. This is a standard classification task that requires no explicit pair mining.

4. Embedding extraction. After training, the classification head is discarded. The output of the penultimate layer — a 512-dimensional vector — serves as the image embedding for retrieval.

This approach has several elegant properties. The geographic grouping naturally creates a training signal — images in the same cell should produce similar embeddings, images in different cells should not — without complex pair selection logic. The classification loss is stable and well-understood, making training straightforward. And the approach scales easily to larger datasets and finer geographic resolutions.

What the embedding captures

The 512-dimensional vector that CosPlace produces is packed with information but is not human-interpretable. Individual dimensions do not correspond to identifiable concepts like “has a church” or “near water.” The information is distributed across all dimensions in complex combinations.

What the embedding is roughly invariant to: changes in lighting and weather, seasonal variation, transient objects (parked cars, pedestrians), moderate viewpoint changes, and time of day.

What it roughly encodes: spatial layout of permanent structures, architectural style, road and infrastructure patterns, vegetation type and terrain features.

No model is perfectly invariant to all appearance changes or perfectly sensitive to all location-relevant features. CosPlace represents a learned trade-off between ignoring irrelevant variation and retaining location-discriminative information.

CosPlace vs. earlier approaches

NetVLAD (2016) combined a CNN with a VLAD aggregation layer but required careful triplet mining and produced large descriptor vectors. GeM (Generalised Mean Pooling) simplified aggregation but still relied on contrastive losses that required pair mining.

CosPlace achieves competitive or superior performance while being dramatically simpler to train. The group-based classification eliminates pair mining entirely, and the 512-dimensional embeddings are compact enough for efficient retrieval. On the Pitts250k benchmark, CosPlace achieves over 90% recall@1 — for 90% of query images, the correct location is the top result.

From research to production

Deploying CosPlace in production involves practical considerations that go beyond what research papers cover. We run inference through ONNX Runtime for optimised execution speed. All embeddings are L2-normalised before storage, reducing cosine similarity to a simple dot product compatible with our vector database. Every image in our coverage of the Netherlands is pre-processed and stored with its embedding and GPS coordinates.

Quality control matters too — images with heavy occlusion, extreme exposure or camera artefacts are filtered before indexing, since the quality of the reference database directly affects retrieval results.

CosPlace gives GeoPin a robust, efficient foundation for visual place recognition. It converts the messy, variable visual world into a tidy mathematical space where similar places are near and different places are far. Everything in our pipeline builds on that foundation.