LoQI-VPR

Distillation Improves Visual Place Recognition for Low Quality Images

New York University

Snapshot

A knowledge-distillation approach enhances visual place recognition accuracy under low-quality images by learning from high-quality data, achieving significant recall rate improvements across diverse VPR methods and datasets.


Retrieval Figure

Method

Overview Image

Distillation Architecture: Our goal was to enhance existing VPR methods to extract more representative global descriptors from lower-quality images. Thus, we apply knowledge distillation techniques at the descriptor extraction level of VPR. For any VPR method, we aim to allow a student descriptor extractor handling low-quality images to approximate the output of its teacher counterpart, handling high-quality images.


Performance Improvements after Distillation

For low-quality images, the loss combination producing the highest recall rate is compared against the baseline of fine-tuning. For both distilled and fine-tuned weights, the change in VPR recall is represented as a delta relative to the performance of each method using pretrained weights. Within each dataset and method, green text indicates the greatest improvement for every R@N, whereas any decrease relative to pretrained performance is marked as red. The recall rates using pretrained weights on unmodified high-quality images (\( I^h \)) are provided as a reference.

VPR Methods Configuration Mapillary SLS Nordland Tokyo 24/7
R@1 R@2 R@5 R@10 R@1 R@2 R@5 R@10 R@1 R@2 R@5 R@10
MixVPR pretrained (\( I^h \)) 82.7386.6789.7391.65 57.7964.1371.4976.41 87.3089.5292.0693.65
pretrained (\( I^l \)) 71.8776.6181.2284.19 31.0536.1244.1351.23 66.0373.3378.7382.22
finetuned +4.43+3.86+3.60+2.99 +13.37+14.28+14.64+14.46 +9.52+7.62+6.03+5.40
ICKD +4.33+3.69+3.20+2.93 +15.00+16.59+16.78+15.94 +8.25+6.03+6.35+6.67
CricaVPR pretrained (\( I^h \)) 74.7480.9286.0988.75 87.6490.6594.2495.80 90.1692.3895.5696.19
pretrained (\( I^l \)) 68.1474.4180.2483.19 63.5169.7177.4682.64 74.2979.6884.7687.94
finetuned -0.12-0.64-0.77-0.23 -2.97-2.93-2.97-2.79 +5.08+4.44+2.54+1.90
ICKD +0.89+1.39+1.11+1.14 +11.63+10.80+8.95+7.03 +6.98+5.08+6.03+4.76
DINOv2 SALAD pretrained (\( I^h \)) 89.2092.4094.7095.84 88.0890.9494.1395.98 97.1497.4698.7399.05
pretrained (\( I^l \)) 84.6088.8591.8993.68 67.9073.8880.4084.24 89.2192.0695.8796.51
finetuned -0.22-0.57-0.60-0.76 +0.04-0.40-0.76-0.04 -0.630.00-0.950.00
ICKD +0.52+0.45+0.20+0.30 +1.56+1.81+1.09+1.20 +1.59+1.27+0.320.00
NetVLAD pretrained (\( I^h \)) 49.2955.2462.0767.29 5.516.678.6211.38 60.6363.8169.2174.29
pretrained (\( I^l \)) 32.6037.6744.9250.13 1.993.014.937.07 27.9433.0241.9047.94
finetuned +0.03+0.01-0.02-0.01 0.000.000.000.00 0.000.000.000.00
MSE +4.46+5.32+5.09+5.01 +0.54+0.43-0.18-0.72 +8.25+6.35+6.03+5.08
AnyLoc pretrained (\( I^h \)) 56.5162.3168.2872.89 12.9016.2320.1824.28 88.2591.1194.9296.83
pretrained (\( I^l \)) 48.0456.4563.8469.17 10.1112.9717.8621.92 83.4987.3091.7595.87
ICKD + Triplet +0.99+1.54+2.68+2.44 +1.74+1.88+1.52+2.54 -5.71-1.59+1.59-0.95

This table summarizes the effectiveness of the proposed loss functions for various VPR methods across different datasets. Distillation generally produces positive recall rate improvements with pretrained weights, outperforming fine-tuning in consistency and impact. Notably, the results for MixVPR and CricaVPR on Nordland show the distinct advantage of distillation. Cases where distillation did not yield higher recall, such as NetVLAD on Nordland or AnyLoc on Tokyo 24/7, suggest unique interactions between these methods and datasets that might require additional analysis.

Loss Combinations Image

The figure above further illustrates the impact of different loss combinations on VPR recall rates across methods. The consistent trends indicate that ICKD and MSE losses are generally effective, with a clear advantage for MixVPR and CricaVPR. While NetVLAD sees greater benefits from triplet loss on Nordland, ICKD and MSE prove to be the most reliable choices for maximizing VPR performance across methods and datasets.


Activation Maps



The activation maps below illustrate the effects of distillation on the feature extraction process for each VPR method: CricaVPR, NetVLAD, and MixVPR. By visually highlighting the areas of focus in each method’s encoder, these maps reveal how distillation shifts attention toward more informative, distinct scene features, while minimizing the impact of less relevant background elements, such as sky and repetitive foreground patterns. As additional reference for activation map calculation methods applicable to VPR in general, details of the mathematical approaches used in this section could be found here.

Query Image

Query Image

CricaVPR

NetVLAD

MixVPR

Query Image

Query Image

CricaVPR

NetVLAD

MixVPR

Query Image

Query Image

CricaVPR

NetVLAD

MixVPR


Navigation Demo Video

This video demonstrates the navigation capabilities of the VPR system. Observe how the system accurately guides the user through complex environments with real-time visual recognition and localization.


BibTeX


@misc{yang2024distillationimprovesvisualplace,
      title={Distillation Improves Visual Place Recognition for Low Quality Images}, 
      author={Anbang Yang and Ge Jin and Junjie Huang and Yao Wang and John-Ross Rizzo and Chen Feng},
      year={2024},
      eprint={2310.06906},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2310.06906}, 
}

Acknowledgements

This work is supported partly by NSF Grants 2238968 and 2345139; by the National Eye Institute and Fogarty International Center under Grants R21EY033689, R33EY033689, and R01EY036667; and by the NYU IT High Performance Computing resources, services, and staff expertise. We would like to express our gratitude to Zezheng Li and Liyuan Geng for their valuable assistance with test data preprocessing. Their contributions helped facilitate the progress of this work.