We train a neural network to predict human gene expression levels based on experimental data for rat cells. The network is trained with paired human/rat samples from the Open TG-GATES database, where paired samples were treated with the same compound at the same dose. When evaluated on a test set of held out compounds, the network successfully predicts human expression levels. On the majority of the test compounds, the list of differentially expressed genes determined from predicted expression levels agrees well with the list of differentially expressed genes determined from actual human experimental data. Comment: 12 pages, 5 figures
Computer Science - Machine Learning and Statistics - Machine Learning
Many applications of machine learning in science and medicine, including molecular property and protein function prediction, can be cast as problems of predicting some properties of graphs, where having good graph representations is critical. However, two key challenges in these domains are (1) extreme scarcity of labeled data due to expensive lab experiments, and (2) needing to extrapolate to test graphs that are structurally different from those seen during training. In this paper, we explore pre-training to address both of these challenges. In particular, working with Graph Neural Networks (GNNs) for representation learning of graphs, we wish to obtain node representations that (1) capture similarity of nodes' network neighborhood structure, (2) can be composed to give accurate graph-level representations, and (3) capture domain-knowledge. To achieve these goals, we propose a series of methods to pre-train GNNs at both the node-level and the graph-level, using both unlabeled data and labeled data from related auxiliary supervised tasks. We perform extensive evaluation on two applications, molecular property and protein function prediction. We observe that performing only graph-level supervised pre-training often leads to marginal performance gain or even can worsen the performance compared to non-pre-trained models. On the other hand, effectively combining both node- and graph-level pre-training techniques significantly improves generalization to out-of-distribution graphs, consistently outperforming non-pre-trained GNNs across 8 datasets in molecular property prediction (resp. 40 tasks in protein function prediction), with the average ROC-AUC improvement of 7.2% (resp. 11.7%).
Feinberg, Evan N., Sheridan, Robert, Joshi, Elizabeth, Pande, Vijay S., and Cheng, Alan C.
Computer Science - Machine Learning and Statistics - Machine Learning
The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest or a deep neural network, leverage fixed fingerprint feature representations of molecules. In contrast, in this paper, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph, where each node is an atom and each edge is a bond. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but also extrapolate to new regions of chemical space. Comment: 41 pages
Physics - Chemical Physics and Physics - Computational Physics
Density functional theory (DFT) is one of the main methods in Quantum Chemistry that offers an attractive trade off between the cost and accuracy of quantum chemical computations. The electron density plays a key role in DFT. In this work, we explore whether machine learning - more specifically, deep neural networks (DNNs) - can be trained to predict electron densities faster than DFT. First, we choose a practically efficient combination of a DFT functional and a basis set (PBE0/pcS-3) and use it to generate a database of DFT solutions for more than 133,000 organic molecules from a previously published database QM9. Next, we train a DNN to predict electron densities and energies of such molecules. The only input to the DNN is an approximate electron density computed with a cheap quantum chemical method in a small basis set (HF/cc-VDZ). We demonstrate that the DNN successfully learns differences in the electron densities arising both from electron correlation and small basis set artifacts in the HF computations. All qualitative features in density differences, including local minima on lone pairs, local maxima on nuclei, toroidal shapes around C-H and C-C bonds, complex shapes around aromatic and cyclopropane rings and CN group, etc. are captured by the DNN. Accuracy of energy predictions by the DNN is ~ 1 kcal/mol, on par with other models reported in the literature, while those models do not predict the electron density. Computations with the DNN, including HF computations, take much less time that DFT computations (by a factor of ~20-30 for most QM9 molecules in the current version, and it is clear how it could be further improved).
We train a neural network to predict chemical toxicity based on gene expression data. The input to the network is a full expression profile collected either in vitro from cultured cells or in vivo from live animals. The output is a set of fine grained predictions for the presence of a variety of pathological effects in treated animals. When trained on the Open TG-GATEs database it produces good results, outperforming classical models trained on the same data. This is a promising approach for efficiently screening chemicals for toxic effects, and for more accurately evaluating drug candidates based on preclinical data. Comment: 12 pages, 2 figures, 4 tables
In this paper we introduce Curriculum GANs, a curriculum learning strategy for training Generative Adversarial Networks that increases the strength of the discriminator over the course of training, thereby making the learning task progressively more difficult for the generator. We demonstrate that this strategy is key to obtaining state-of-the-art results in image generation. We also show evidence that this strategy may be broadly applicable to improving GAN training in other data modalities.
Sharma, Rishi, Farimani, Amir Barati, Gomes, Joe, Eastman, Peter, and Pande, Vijay
Statistics - Machine Learning and Computer Science - Machine Learning
In typical machine learning tasks and applications, it is necessary to obtain or create large labeled datasets in order to to achieve high performance. Unfortunately, large labeled datasets are not always available and can be expensive to source, creating a bottleneck towards more widely applicable machine learning. The paradigm of weak supervision offers an alternative that allows for integration of domain-specific knowledge by enforcing constraints that a correct solution to the learning problem will obey over the output space. In this work, we explore the application of this paradigm to 2-D physical systems governed by non-linear differential equations. We demonstrate that knowledge of the partial differential equations governing a system can be encoded into the loss function of a neural network via an appropriately chosen convolutional kernel. We demonstrate this by showing that the steady-state solution to the 2-D heat equation can be learned directly from initial conditions by a convolutional neural network, in the absence of labeled training data. We also extend recent work in the progressive growing of fully convolutional networks to achieve high accuracy (< 1.5% error) at multiple scales of the heat-flow problem, including at the very large scale (1024x1024). Finally, we demonstrate that this method can be used to speed up exact calculation of the solution to the differential equations via finite difference.
Farimani, Amir Barati, Feinberg, Evan N., and Pande, Vijay S.
Quantitative Biology - Biomolecules and Quantitative Biology - Quantitative Methods
Many important analgesics relieve pain by binding to the $\mu$-Opioid Receptor ($\mu$OR), which makes the $\mu$OR among the most clinically relevant proteins of the G Protein Coupled Receptor (GPCR) family. Despite previous studies on the activation pathways of the GPCRs, the mechanism of opiate binding and the selectivity of $\mu$OR are largely unknown. We performed extensive molecular dynamics (MD) simulation and analysis to find the selective allosteric binding sites of the $\mu$OR and the path opiates take to bind to the orthosteric site. In this study, we predicted that the allosteric site is responsible for the attraction and selection of opiates. Using Markov state models and machine learning, we traced the pathway of opiates in binding to the orthosteric site, the main binding pocket. Our results have important implications in designing novel analgesics. Comment: 25 pages, 8 figures
Phase segregation, the process by which the components of a binary mixture spontaneously separate, is a key process in the evolution and design of many chemical, mechanical, and biological systems. In this work, we present a data-driven approach for the learning, modeling, and prediction of phase segregation. A direct mapping between an initially dispersed, immiscible binary fluid and the equilibrium concentration field is learned by conditional generative convolutional neural networks. Concentration field predictions by the deep learning model conserve phase fraction, correctly predict phase transition, and reproduce area, perimeter, and total free energy distributions up to 98% accuracy. Comment: arXiv admin note: text overlap with arXiv:1709.02432
Physics - Chemical Physics, Computer Science - Learning, Physics - Biological Physics, and Statistics - Machine Learning
As deep Variational Auto-Encoder (VAE) frameworks become more widely used for modeling biomolecular simulation data, we emphasize the capability of the VAE architecture to concurrently maximize the timescale of the latent space while inferring a reduced coordinate, which assists in finding slow processes as according to the variational approach to conformational dynamics. We additionally provide evidence that the VDE framework (Hern\'andez et al., 2017), which uses this autocorrelation loss along with a time-lagged reconstruction loss, obtains a variationally optimized latent coordinate in comparison with related loss functions. We thus recommend leveraging the autocorrelation of the latent space while training neural network models of biomolecular simulation data to better represent slow processes.