2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Intelligent Robots and Systems (IROS), 2022 IEEE/RSJ International Conference on. :10425-10431 Oct, 2022
2022 16th IEEE International Conference on Signal Processing (ICSP) Signal Processing (ICSP), 2022 16th IEEE International Conference on. 1:245-249 Oct, 2022
2022 16th IEEE International Conference on Signal Processing (ICSP) Signal Processing (ICSP), 2022 16th IEEE International Conference on. 1:101-105 Oct, 2022
Gao, Huameng, Xue, Yong, Sun, Jin, Wu, Hongpu, Dong, Yin, and Gao, Kai
2022 International Conference on Data Analytics, Computing and Artificial Intelligence (ICDACAI) ICDACAI Data Analytics, Computing and Artificial Intelligence (ICDACAI), 2022 International Conference on. :453-455 Aug, 2022
Gao, Kai, Lau, Darren, Huang, Baichuan, Bekris, Kostas E., and Yu, Jingjin
2022 International Conference on Robotics and Automation (ICRA) Robotics and Automation (ICRA), 2022 IEEE International Conference on. :1961-1967 May, 2022
Vieira, Ewerton R., Nakhimovich, Daniel, Gao, Kai, Wang, Rui, Yu, Jingjin, and Bekris, Kostas E.
2022 International Conference on Robotics and Automation (ICRA) Robotics and Automation (ICRA), 2022 IEEE International Conference on. :1918-1924 May, 2022
Gao, Kai, Gao, Huameng, Sun, Jin, and Zhang, Jiawen
2022 3rd International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA) Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), 2022 3rd International Conference on. :244-247 May, 2022
2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC) ICPC Program Comprehension (ICPC), 2022 IEEE/ACM 30th International Conference on. :602-613 May, 2022
2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) ICSE Software Engineering (ICSE), 2022 IEEE/ACM 44th International Conference on. :86-98 May, 2022
Yang, Kaicheng, Zhang, Ruxuan, Xu, Hua, and Gao, Kai
Subjects
Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Sound, and Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract
Inter-modal interaction plays an indispensable role in multimodal sentiment analysis. Due to different modalities sequences are usually non-alignment, how to integrate relevant information of each modality to learn fusion representations has been one of the central challenges in multimodal learning. In this paper, a Self-Adjusting Fusion Representation Learning Model (SA-FRLM) is proposed to learn robust crossmodal fusion representations directly from the unaligned text and audio sequences. Different from previous works, our model not only makes full use of the interaction between different modalities but also maximizes the protection of the unimodal characteristics. Specifically, we first employ a crossmodal alignment module to project different modalities features to the same dimension. The crossmodal collaboration attention is then adopted to model the inter-modal interaction between text and audio sequences and initialize the fusion representations. After that, as the core unit of the SA-FRLM, the crossmodal adjustment transformer is proposed to protect original unimodal characteristics. It can dynamically adapt the fusion representations by using single modal streams. We evaluate our approach on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that our model has significantly improved the performance of all the metrics on the unaligned text-audio sequences. Comment: 8 pages
Vieira, Ewerton R., Gao, Kai, Nakhimovich, Daniel, Bekris, Kostas E., and Yu, Jingjin
Subjects
Computer Science - Robotics
Abstract
Performing object retrieval tasks in messy real-world workspaces involves the challenges of \emph{uncertainty} and \emph{clutter}. One option is to solve retrieval problems via a sequence of prehensile pick-n-place operations, which can be computationally expensive to compute in highly-cluttered scenarios and also inefficient to execute. The proposed framework selects the option of performing non-prehensile actions, such as pushing, to clean a cluttered workspace to allow a robotic arm to retrieve a target object. Non-prehensile actions, allow to interact simultaneously with multiple objects, which can speed up execution. At the same time, they can significantly increase uncertainty as it is not easy to accurately estimate the outcome of a pushing operation in clutter. The proposed framework integrates topological tools and Monte-Carlo tree search to achieve effective and robust pushing for object retrieval problems. In particular, it proposes using persistent homology to automatically identify manageable clustering of blocking objects in the workspace without the need for manually adjusting hyper-parameters. Furthermore, MCTS uses this information to explore feasible actions to push groups of objects together, aiming to minimize the number of pushing actions needed to clear the path to the target object. Real-world experiments using a Baxter robot, which involves some noise in actuation, show that the proposed framework achieves a higher success rate in solving retrieval tasks in dense clutter compared to state-of-the-art alternatives. Moreover, it produces high-quality solutions with a small number of pushing actions improving the overall execution time. More critically, it is robust enough that it allows to plan the sequence of actions offline and then execute them reliably online with Baxter.
We investigate the utility of employing multiple buffers in solving a class of rearrangement problems with pick-n-swap manipulation primitives. In this problem, objects stored randomly in a lattice are to be sorted using a robot arm with k>=1 swap spaces or buffers, capable of holding up to k objects on its end-effector simultaneously. On the structural side, we show that the addition of each new buffer brings diminishing returns in saving the end-effector travel distance while holding the total number of pick-n-swap operations at the minimum. This is due to an interesting recursive cycle structure in random m-permutation, rigorously proven, where the largest cycle covers over 60% of objects. On the algorithmic side, we propose fast algorithms for 1D and 2D lattice rearrangement problems that can effectively use multiple buffers to boost solution optimality. Numerical experiments demonstrate the efficiency and scalability of our methods, as well as confirm the diminishing return structure as more buffers are employed. Comment: Summitted to 2023 IEEE International Conference on Robotics and Automation(ICRA 2023)
Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, and Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract
Multimodal sentiment analysis (MSA), which supposes to improve text-based sentiment analysis with associated acoustic and visual modalities, is an emerging research area due to its potential applications in Human-Computer Interaction (HCI). However, the existing researches observe that the acoustic and visual modalities contribute much less than the textual modality, termed as text-predominant. Under such circumstances, in this work, we emphasize making non-verbal cues matter for the MSA task. Firstly, from the resource perspective, we present the CH-SIMS v2.0 dataset, an extension and enhancement of the CH-SIMS. Compared with the original dataset, the CH-SIMS v2.0 doubles its size with another 2121 refined video segments with both unimodal and multimodal annotations and collects 10161 unlabelled raw video segments with rich acoustic and visual emotion-bearing context to highlight non-verbal cues for sentiment prediction. Secondly, from the model perspective, benefiting from the unimodal annotations and the unsupervised data in the CH-SIMS v2.0, the Acoustic Visual Mixup Consistent (AV-MC) framework is proposed. The designed modality mixup module can be regarded as an augmentation, which mixes the acoustic and visual modalities from different videos. Through drawing unobserved multimodal context along with the text, the model can learn to be aware of different non-verbal contexts for sentiment prediction. Our evaluations demonstrate that both CH-SIMS v2.0 and AV-MC framework enables further research for discovering emotion-bearing acoustic and visual cues and paves the path to interpretable end-to-end HCI applications for real-world scenarios. Comment: 16pages, 7 figures, accepted by ICMI 2022