Accelerating machine learning with training data management
- Alexander Jason Ratner.
- [Stanford, California] : [Stanford University], 2019.
- Copyright notice
- Physical description
- 1 online resource.
Also available at
- One of the biggest bottlenecks in developing machine learning applications today is the need for large hand-labeled training datasets. Even at the world's most sophisticated technology companies, and especially at other organizations across science, medicine, industry, and government, the time and monetary cost of labeling and managing large training datasets is often the blocking factor in using machine learning. In this thesis, we describe work on training data management systems that enable users to programmatically build and manage training datasets, rather than labeling and managing them by hand, and present algorithms and supporting theory for automatically modeling this noisier process of training set specification in order to improve the resulting training set quality. We then describe extensive empirical results and real-world deployments demonstrating that programmatically building, managing, and modeling training sets in this way can lead to radically faster, more flexible, and more accessible ways of developing machine learning applications. We start by describing data programming, a paradigm for labeling training datasets programmatically rather than by hand, and Snorkel, an open source training data management system built around data programming that has been used by major technology companies, academic labs, and government agencies to build machine learning applications in days or weeks rather than months or years. In Snorkel, rather than hand-labeling training data, users write programmatic operators called labeling functions, which label data using various heuristic or weak supervision strategies such as pattern matching, distant supervision, and other models. These labeling functions can have noisy, conflicting, and correlated outputs, which Snorkel models and combines into clean training labels without requiring any ground truth using theoretically consistent modeling approaches we develop. We then report on extensive empirical validations, user studies, and real-world applications of Snorkel in industrial, scientific, medical, and other use cases ranging from knowledge base construction from text data to medical monitoring over image and video data. Next, we will describe two other approaches for enabling users to programmatically build and manage training datasets, both currently integrated into the Snorkel open source framework: Snorkel MeTaL, an extension of data programming and Snorkel to the setting where users have multiple related classification tasks, in particular focusing on multi-task learning; and TANDA, a system for optimizing and managing strategies for data augmentation, a critical training dataset management technique wherein a labeled dataset is artificially expanded by transforming data points. Finally, we will conclude by outlining future research directions for further accelerating and democratizing machine learning workflows, such as higher-level programmatic interfaces and massively multi-task frameworks.
- Publication date
- Copyright date
- Submitted to the Computer Science Department.
- Thesis Ph.D. Stanford University 2019.