Thesis title: Unified Multimodal 3D Reconstructions
In our modern digital era, taking photos with smartphones is now a common habit. However, these 2D photos show only a part of the whole scene. 3D reconstruction, on the other hand, provides a comprehensive, multi-faceted view of scenes, enriching experiences from personal memories to professional fields like urban planning and archaeology. This technique is fundamental in augmented reality and robotic navigation, where a deep grasp of the world’s 3D structure is essential.
Current advancements, while significant, have yet to realize an “ideal” 3D reconstruction system. Such a system would consistently capture nearly all visible surfaces, being extremely robust in camera tracking, and deliver intricate reconstructions rapidly. It would seamlessly scale from small to large environments without loosing accuracy and perform optimally across diverse environmental and lighting conditions.
Beyond classic cameras, LiDAR has gained significant attention over the past decade as another tool to perceive and reconstruct our surroundings. While cameras provide visually-rich data, their depth perception is often limited (hard to estimate), especially in low-light conditions. Contrastingly, LiDARs, relying on their emitted light, excel in various lighting scenarios and larger environments, but their measurements are sparse and lack colors. Hence, an “ideal” 3D reconstruction pipeline, to achieve impressive results, should rely on both sensors, to mitigate their limitations.
This research tries to exploit similarities between the two sensors, underscoring the value of integrating uniformly data from both to achieve a more comprehensive environmental understanding, ensuring accurate performance without extensive waiting.
However, integrating camera and LiDAR data presents challenges due to their distinct data natures, necessitating precise calibration, synchronization, and complex multimodal processing pipelines. Advances in technology have facilitated similarities between LiDAR generated images and those from passive sensors, opening avenues for visual place recognition. Our exploration in this domain yielded promising results, particularly highlighting LiDAR’s consistent performance in diverse lighting.
Our research journey then transitioned to bridging the gap between LiDAR and RGB-D sensors. By devising a Simultaneous Localization and Mapping (SLAM) pipeline adaptable to both sensors and rooted in photometric alignment, our findings were comparable with specialized systems. Delving into Bundle Adjustment, our generalized strategy showcased remarkable efficiency, especially when merging data from both the sensors. Further refinement incorporated geometric information, balancing robustness with precision and achieving impressive accuracy across varied environments.
In addition, we introduce a robotics perception dataset from Rome, encompassing RGB, dense depth, 3D LiDAR point clouds, IMU, and GPS data. Recognizing current dataset limitations and the proficiency of contemporary SLAM and 3D reconstruction methods, our dataset offers a fresh challenge to push algorithm boundaries. We emphasize precise calibration and synchronization, capturing varied settings from indoor to highways using modern equipment. Collected both manually and through vehicles, it is tailored for a range of robotic uses.
In essence, this thesis encapsulates the pursuit of enhancing SLAM and 3D reconstructions through multimodality. By harnessing the capabilities of diverse depth sensors, we have made significant progress in the domain, paving the way for more integrated, compact, robust and detailed systems in the future.