Training and deploying visual agents at scale
- Responsibility
- Linxi Fan
- Publication
- [Stanford, California] : [Stanford University], 2021
- Copyright notice
- ©2021
- Physical description
- 1 online resource
Digital content
Also available at
More options
Description
Creators/Contributors
- Author/Creator
- Fan, Linxi, author.
- Contributor
- Li, Fei Fei, 1976- degree supervisor. Thesis advisor
- Niebles Duque, Juan Carlos, 1980- degree committee member. Thesis advisor
- Wu, Jiajun (Computer scientist), degree committee member. Thesis advisor
- Stanford University. Computer Science Department.
Contents/Summary
- Summary
- Autonomous agents that perceive and interact with the world, such as home robots and self-driving vehicles, hold great promises to a future that automates mundane tasks and improves the living standards for billions of people. However, two major obstacles stand in our way towards this grand goal. First, modern AI systems require huge amount of data to learn meaningful behaviors, yet training them directly on physics robots is unscalable due to high cost and low efficiency. Second, mobile robot platforms typically have limited onboard computing resources but demand low reaction latency, which hinders the mass deployment of large-capacity visual models. In this dissertation, we will explore an effective recipe towards developing algorithms and systems that are able to train and deploy visual agents at scale. The key idea is to train the agents in rich simulation, then overcome the sim-to-real gap, and finally deploy efficiently on edge devices with lightweight video processing architectures. This dissertation is organized around 4 primary components in the pipeline. First, we propose an open-source distributed framework that provides a full-stack solution to accelerate reinforcement learning (RL) significantly for complex robotics tasks. Second, we construct an ecologically valid and visually realistic simulator for home robotic tasks. Third, we introduce a novel policy learning method that achieves zero-shot generalization to unseen visual environments with large distributional shifts, which facilitates sim-to-real transfer. Finally, we design a new family of video learning architectures that enables deep video understanding for visual agents on resource-constrained devices. We hope that the techniques and ideas presented in this dissertation will bring us one step closer to the future where intelligent robots will become as ubiquitous as smartphones in our lives
Bibliographic information
- Publication date
- 2021
- Copyright date
- 2021
- Note
- Submitted to the Computer Science Department
- Note
- Thesis Ph.D. Stanford University 2021