Distillation Improves Visual Place Recognition for Low-Quality Queries

New York University,
Brooklyn, NY 11201, USA

VPR4LQQ: We introduce a distillation neural network approach, enhancing the performance of compressed images in Visual Place Recognition. Using the Pitts250k dataset and our custom indoor dataset, we demonstrate the efficacy of our model, especially when combined with the Inter-Channel Knowledge Distillation (ICKD) loss. The video showcases a comparative animation, highlighting our model's superiority over the NetVLAD baseline in various scenarios.

Abstract

The shift to online computing for real-time visual localization often requires streaming query images/videos to a server for visual place recognition (VPR), where fast video transmission may result in reduced resolution or increased quantization. This compromises the quality of global image descriptors, leading to decreased VPR performance. To improve the low recall rate for low-quality query images, we present a simple yet effective method that uses high-quality queries only during training to distill better feature representations for deep-learning-based VPR, such as NetVLAD. Specifically, we use mean squared error (MSE) loss between the global descriptors of queries with different qualities, and inter-channel correlation knowledge distillation (ICKD) loss over their corresponding intermediate features. We validate our approach using the both Pittsburgh 250k dataset and our own indoor dataset with varying quantization levels. By fine-tuning NetVLAD parameters with our distillation-augmented losses, we achieve notable VPR recall-rate improvements over low-quality queries, as demonstrated in our extensive experimental results. We believe this work not only pushes forward the VPR research but also provides valuable insights for applications needing dependable place recognition under resource-limited conditions.

Method

Our approach to Visual Place Recognition (VPR) leverages knowledge distillation to make a student network mimic a sophisticated teacher network. Rooted in the NetVLAD algorithm, we employ a dual-branch distillation model comprising a student and teacher branch, both utilizing the VGG-16 architecture. The student branch processes low-quality images, while the teacher works with high-quality images. To assess the student's performance, we introduced three loss functions: Inter-channel Correlation Knowledge Distillation (ICKD) Loss, Mean Squared Error (MSE) Loss, and a Weakly Supervised Triplet Ranking Loss. However, our experiments indicated that a composite loss, excluding the triplet loss, yielded the best results. The final loss function is a weighted sum of ICKD and MSE losses.

Method Illustration

Experiment and Result

We validated our distillation model using the Pitts250k dataset, a cornerstone in VPR research, and our custom dataset curated from the 6th floor of the Lighthouse Guild. The Pitts250k dataset was segmented for training, database, and validation purposes and underwent downsampling to distinct resolutions. Our dataset emphasizes the effects of video bitrate and resolution, involving extensive video processing, including downsampling and quantization. Post-processing, frames were extracted, and ground truth locations were ascertained using a combination of OpenVSLAM and the Aligner GUI.

Result figure

The following videos and figures presents heatmaps indicating the regions of interest for four models trained using different loss settings. Observably, our model (MSE+ICKD loss) predominantly concentrates on critical regions while overlooking areas prone to visual aliasing. This focused attention might underpin the superior performance exhibited by this particular loss setting.

Attention Map

BibTeX

BibTex Code Here