BALViT

Abstract

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes.

Motivation

Traditional LiDAR methods rely on fully supervised learning, which requires large amounts of labeled data. Self-supervised learning, however, remains challenging for LiDAR due to the lack of diverse pre-training datasets that account for sensor variations. To reduce dependence on labeled data, some approaches leverage models pre-trained on other modalities. However, these methods still train the LiDAR sub-network from scratch and require precisely time-synchronized camera and LiDAR data with accurate calibration. We follow the paradigm of a universal foundation model where only the patch embedding and the decoder are tailored to 3D point clouds. Consequently, we directly leverage frozen vision backbones which we enhance with our 2D-3D adapter.

Technical Approach

Figure: Our network BALViT encodes a point cloud in orthogonal range-view (RV) and bird-eye-view (BEV) branches which interact during the traversal of the frozen ViT backbone through our 2D-3D adapter. Last, our two decoders independently obtain pointwise class labels from the respective feature maps.

BALViT is tailored for adapting pre-trained vision transformer (ViT) backbones for LiDAR semantic segmentation. We argue that vision transformer architectures enable amodal feature encoding, which we enhance with our label-efficien 2D-3D adapter. Specifically, we first encode the point cloud in two separate branches, a range-view (RV) and a BEV encoder. Our spatial prior module converts the BEV branch into multi-scale features. Then, we separately add our 3D positional embedding which ensures that the interactions between the feature maps take into account the spatial geometries of the 3D scene. Next, the RV features are processed by a frozen ViT backbone. During this backbone traversal, our novel 2D-3D adapter enables bidirectional feature enhancement between RV and BEV representations through parallel cross-attention modules. Finally, we decode each feature branch with separate 3D decoders to combine the strengths of both views, effectively reducing misclassifications. We train our network using multi-class Focal (\(L_{f}\)) and Lovász-Softmax (\(L_{lovazs}\)) losses separately for each decoder branch's pointwise predictions.

Publications

If you find our work useful, please consider citing our paper:

Julia Hindel, Rohit Mohan, Jelena Bratulić, Daniele Cattaneo, Thomas Brox, Abhinav Valada
Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters
arXiv preprint arXiv:2503.03299

(PDF(ArXiv)) (BibTeX)