Research

Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition

We address a novel cross-domain few-shot learning task (CD-FSL) with multimodal input and unlabeled target data for egocentric action recognition. This paper simultaneously tackles two critical challenges associated with egocentric action recognition in CD-FSL settings:(1) the extreme domain gap in egocentric videos (e.g., daily life vs. industrial domain) and (2) the computational cost for real-world applications. We propose MM-CDFSL, a domain-adaptive and computationally efficient approach designed to enhance adaptability to the target domain and improve inference speed. To address the first challenge, we propose the incorporation of multimodal distillation into the student RGB model using teacher models. Each teacher model is trained independently on source and target data for its respective modality. Leveraging only unlabeled target data during multimodal distillation enhances the student model’s adaptability to the target domain. We further introduce ensemble masked inference, a technique that reduces the number of input tokens through masking. In this approach, ensemble prediction mitigates the performance degradation caused by masking, effectively addressing the second issue. Our approach outperformed the state-of-the-art CD-FSL approaches with a substantial margin on multiple egocentric datasets, improving by an average of 6.12/6.10 points for 1-shot/5-shot settings while achieving 2.2 times faster inference speed.

ECCV 2024

Masashi Hatano, Ryo Hachiuma, Ryo Fujii, Hideo Saito

[Paper] [Code] [Project Page] [YouTube]

Event-based Blur Kernel Estimation For Blind Motion Deblurring

Overview of the proposed method.

This method uses an event camera, which captures high-temporal-resolution data on pixel luminance changes, along with a conventional camera to capture the input blurred image. By analyzing the event data stream, the proposed method estimates the 2D motion of the blurred image at short intervals during the exposure time, and integrates this information to estimate a variety of complex blur motions. With the estimated blur kernel, the input blurred image can be deblurred using deconvolution.

CVPR 2023

Takuya Nakabayashi, Kunihiro Hasegawa, Masakazu Matsugu, Hideo Saito

[Paper]

3D Shape Estimation of Wires From 3D X-Ray CT Images of Electrical Cables

Overall flow of 3D shape estimation of wires in a cable.

In this study, we propose a new method to associate wires using particle tracking techniques. The proposed method performs
wire tracking in two steps, linking detected wire positions between adjacent frames and connecting segmented wires across frames.

QCAV 2023

Shiori Ueda, Kanon Sato, Hideo Saito, Yutaka Hoshina

[Paper]

Incremental Class Discovery for Semantic Segmentation with RGBD Sensing

nakajima_190710
Proposed method incrementally discovers new classes (e.g., pictures) in the reconstructed 3D map.

This work addresses the task of open world semantic segmentation using RGBD sensing to discover new semantic classes over time. Although there are many types of objects in the real-word, current semantic segmentation methods make a closed world assumption and are trained only to segment a limited number of object classes. Towards a more open world approach, we propose a novel method that incrementally learns new classes for image segmentation.

ICCV 2019

Yoshikatsu Nakajima, Byeongkeun Kang, Hideo Saito, Kris Kitani

[YouTube] [Paper]

DetectFusion: Detecting and Segmenting Both Known and Unknown Dynamic Objects in Real-time SLAM

hachiuma_190709
Reconstructed SLAM object maps

We present DetectFusion, an RGB-D SLAM system that runs in real time and can robustly handle semantically known and unknown objects that can move dynamically in the scene. Our system detects, segments and assigns semantic class labels to known objects in the scene, while tracking and reconstructing them even when they move independently in front of the monocular camera.

BMVC 2019

Ryo Hachiuma, Christian Pirchheim, Dieter Schmalstieg, Hideo Saito

[YouTube] [Paper]

EventNet: Asynchronous Recursive Event Processing

sekikawa_181207
Overview of asynchronous event-based pipeline of EventNet (inference) in contrast to the conventional frame-based pipeline of CNN

Event cameras are bio-inspired vision sensors that mimic retinas to asynchronously report per-pixel intensity changes rather than outputting an actual intensity image at regular intervals. This new paradigm of image sensor offers significant potential advantages; namely, sparse and nonredundant data representation. Unfortunately, however, most of the existing artificial neural network architectures, such as a CNN, require dense synchronous input data, and therefore, cannot make use of the sparseness of the data. We propose EventNet, a neural network designed for real-time processing of asynchronous event streams in a recursive and event-wise manner.

CVPR 2019

Yusuke Sekikawa, Kosuke Hara, Hideo Saito

[YouTube] [Paper]

Fast and Accurate Semantic Mapping through Geometric-based Incremental Segmentation

nakajima_180302
Flow of the proposed framework.

We propose an efficient and scalable method for incrementally building a dense, semantically annotated 3D map in real-time. The proposed method assigns class probabilities to each region, not each element (e.g., surfel and voxel), of the 3D map which is built up through a robust SLAM framework and incrementally segmented with a geometric-based segmentation method.

IROS 2018

Yoshikatsu Nakajima, Keisuke Tateno, Federico Tombari, Hideo Saito

[YouTube] [Paper] [Award]

Constant Velocity 3D Convolution

sekikawa_181126
Decomposed spatiotemporal 3D convolution.

We propose a novel 3-D convolution method, cv3dconv, for extracting spatiotemporal features from videos. It reduces the number of sum-of-products operations in 3-D convolution by thousands of times by assuming the constant moving velocity of the features.

3DV 2018

Yusuke Sekikawa, Kohta Ishikawa, Kosuke Hara, Yuuichi Yoshida, Koichiro Suzuki, Ikuro Sato, Hideo Saito