Self-supervised Spatial Reasoning on Multi-View Line Drawings

1New York University Tandon School of Engineering, 2University of California, Berkeley,
* Equal contributions.
The corresponding author is Chen Feng


Spatial reasoning on multi-view line drawings by state-of-the-art supervised deep networks is recently shown with puzzling low performances on the SPARE3D dataset. Based on the fact that self-supervised learning is helpful when a large number of data are available, we propose two self-supervised learning approaches to improve the baseline performance for view consistency reasoning and camera pose reasoning tasks on the SPARE3D dataset. For the first task, we use a self-supervised binary classification network to contrast the line drawing differences between various views of any two similar 3D objects, enabling the trained networks to effectively learn detail-sensitive yet view-invariant line drawing representations of 3D objects. For the second type of task, we propose a self-supervised multi-class classification framework to train a model to select the correct corresponding view from which a line drawing is rendered. Our method is even helpful for the downstream tasks with unseen camera poses. Experiments show that our method could significantly increase the baseline performance in SPARE3D, while some popular self-supervised learning methods cannot.

Contrastive learning network for task T2I

Our contrastive learning network. We use the learned representations to the downstream task T2I. Front (F), Right (R), Top (T) represent line drawings and an isometric (I) line drawing. a1, b1,a2, b2 are the encoded feature vectors; C, K are the dimension of the latent vectors. The fθ1 , fθ2 , fθ3 , and fθ4 are CNN networks; gφ1, gφ2 , and hψ are MLP networks. is a concatenation operation. BCE represents binary cross-entropy loss.

Self-supervised learning network architecture for task I2P and P2I

Our self-supervised learning network (left subfigure). We use the learned representations to the downstream tasks I2P and P2I (right subfigure). The fη1, fη2, fη3, and fη4 are CNN networks; gω1, gω2, and gω3 are MLP networks; c1,d1,e1 are the encoded feature vectors.

Comparison of performance on T2I for SL method vs. SSL method.

SL and SSL represent supervised learning and self-supervised learning, respectively. 5K and 14K are the training data amount. Fine-tuning means we further use the 5K training data in SPARE3D to fine-tune the parameters. For supervised learning, we evaluate the network performance for: (1) using early fusion or late fusion structure, (2) whether or not using ImageNet pre-trained parameters.

SL early-fusion(5K) early-fusion(pretrained, 5K) late-fusion(pretrained , 5K) early-fusion(14K) early-fusion(pretrained, 14K) late-fusion(pretrained, 14K)
55.0 30.6 25.2 63.6 51.4 27.4
SSL Jigsaw puzzle Colorization SimCLR RotNet Ours (NT-Xent loss) Ours (BCE loss)
27.4 23.4 31.0 30.6 48.4 74.9

Comparison of performance on I2P and P2I tasks for SL method vs. SSL method.

With the increase of data amount for training, both supervised learning, and our self-supervised learning-based method achieve higher accuracy. For task I2P, the best accuracy achieves 98.0 %, and for task P2I, the best accuracy is 83.4 %. The best performance happens when using the 40,000 scale dataset with our self-supervised learning method.

Data amount (K) 5 10 15 20 25 30 35 40
I2P(SL) 83.6 86.4 87.7 88.5 88.7 90.4 90.6 91.1
I2P(SSL) 88.7 93.2 95.1 96.4 96.7 97.7 97.5 98.0
P2I(SL) 65.4 67.1 68.5 67.8 69.8 69.6 68.5 70.4
P2I(SSL) 72.4 80.8 81.9 82.1 82.8 83.1 83.0 83.4

Attention maps for SL vs. SSL method in T2I task.

For each CAD model, the first row are the line drawings. The second and third row are the attention maps generated from supervised learning using early fusion and late fusion, respectively. The fourth row are the attention maps generated from our method. N/A indicates no attention map for the corresponding view. Best viewed in color.


The first two authors contributed equally. Chen Feng is the corresponding author. The research is supported by NSF Future Manufacturing program under EEC-2036870. Siyuan Xiang gratefully thanks the IDC Foundation for its scholarship.