NYC-Event-VPR  

A Large-Scale High-Resolution Event-Based Visual Place Recognition Dataset in Dense Urban Environments

Submitted to ICRA 2024

1New York University, Brooklyn, NY 11201, USA



Dataset visualization




Downtown Manhattan in event frame, reconstructed frame, and RGB frame.

Abstract

Visual place recognition (VPR) enables autonomous robots to identify previously visited locations, which contributes to tasks like simultaneous localization and mapping (SLAM). VPR faces challenges such as accurate image neighbor retrieval and appearance change in scenery.

Event cameras, also known as dynamic vision sensors, are a new sensor modality for VPR and offer a promising solution to the challenges with their unique attributes: high temporal resolution (1MHz clock), ultra-low latency (in μs), and high dynamic range (>120dB). These attributes make event cameras less susceptible to motion blur and more robust in variable lighting conditions, making them suitable for addressing VPR challenges. However, the scarcity of event-based VPR datasets, partly due to the novelty and cost of event cameras, hampers their adoption.

To fill this data gap, our paper introduces the NYC-Event-VPR dataset to the robotics and computer vision communities, featuring the Prophesee IMX636 HD event sensor (1280x720 resolution), combined with RGB camera and GPS module. It encompasses over 13 hours of geotagged event data, spanning 260+ kilometers across New York City, covering diverse lighting and weather conditions, day/night scenarios, and multiple visits to various locations.

Furthermore, our paper employs the VPR-Bench framework to conduct generalization performance assessments, promoting innovation in event-based VPR and its integration into robotics applications.


Dataset


Duration (hr)Data size (GB)ModalityDistance (km)WeatherLighting conditionsResolution (px)
13.5466.7event, RGB, GPS259.95rainy, cloudy, sunnyday, night1280 x 720

NYC-Event-VPR dataset statistics.



NYC-Event-VPR covers New York City, focusing on Chinatown area in Manhattan with overlapping traversal.



Sensor setup and mounting design: RGB camera is mounted on top of event camera, and the sensor suite is positioned facing forward behind vehicle`s front windshield.


TypeSpecification
Prophesee EV4 HD
                IMX636ES (HD) event vision sensor,
                Resolution (px): 1280x720,
                Latency (µs): 220,
                Dynamic range (dB): >86,
                Power consumption: 500mW-1.5W,
                Pixel size (µm): 4.86x4.86,
                Camera max bandwidth (Gbps): 1.6,
                Interface: USB 3.0
              
ELP USB Camera
                CMOS 1080p sensor,
                Resolution (px): 1280x720,
                Interface: USB 2.0,
                5-50mm varifocal lens
              
Sparkfun GPS-RTK-SMA
                Horizontal accuracy: 2.5m w/o RTK,
                Max altitude: 50km,
                Max velocity: 500m/s,
                GPS, GLONASS, Galileo, BeiDou
              

Sensor specifications.



Example images in processed dataset (from left to right columns): naive conversion, E2VID reconstruction, RGB reference. Each row is the same visual scene. Each column is the same dataset.



Benchmark


DatasetsNetVLADRegionVLADHOGAMOSNetHybridNetCALC
NYC-Event-VPR-Naive-5m32.5324.8525.1365.7668.6974.17
NYC-Event-VPR-Naive-2m25.6820.3521.0257.7461.8466.87
NYC-Event-VPR-E2VID-5m74.0677.5686.0386.2285.5184.69
NYC-Event-VPR-E2VID-2m68.0873.4483.8482.9982.1281.05
NYC-Event-VPR-RGB-5m92.5292.5892.4394.5294.3392.86
NYC-Event-VPR-RGB-2m88.3489.9491.4891.8292.3690.16
Pittsburgh250K94.3673.340.278.538.702.05
Nordland8.4913.332.8930.1317.5212.91

Quantitative results of Area Under the Curve- Precision Recall (AUC-PR) in percentage.

Top 6 datasets are from NYC-Event-VPR. Each row represents a set of 6 benchmark tests on that dataset. Bottom 2 datasets are curated subsets of their respective datasets provided by VPR-Bench. All tests are done using pretrained weights provided by VPR-Bench.


CCT384+NetVLADNaive-5mE2VID-5mRGB-5mNaive-15mE2VID-15mRGB-15mNaive-25mE2VID-25mRGB-25m
Recall@133.248.757.75170.577.857.377.986
Recall@545.154.461.36478.281.474.185.490.1
Recall@1048.755.862.468.680.382.479.587.191
Recall@2051.257.163.273.281.783.183.988.791.7

ResNet50+NetVLADNaive-5mE2VID-5mRGB-5mNaive-15mE2VID-15mRGB-15mNaive-25mE2VID-25mRGB-25m
Recall@139.151.959.154.973.17962.380.586.9
Recall@548.954.762.268.678.482.476.485.891
Recall@1051.555.662.572.679.883.180.787.591.5
Recall@2053.156.262.875.880.983.583.78992

Quantitative results of recall@k in percentage.

First table is test results of deep learning model trained on NYC-Event-VPR. Backbone is CCT384 (Compact Convolutional Transformer). Aggregation is done via NetVLAD.

Second table is also test results of deep learning model trained on NYC-Event-VPR. Backbone is ResNet50 (Residual Network). Aggregation is done via NetVLAD.

All benchmarks are done by training the model on NYC-Event-VPR dataset using Deep Visual Geo-localization Benchmark framework.


Acknowledgements

Chen Feng is the corresponding author (cfeng@nyu.edu). This work is supported by NSF Grant 2238968.