1 - 20
Number of results to display per page
- Zaharia, Matei author.
- First edition. - [New York] : Association for Computing Machinery ; [San Rafael, California] : Morgan & Claypool, 2016.
- Description
- Book — 1 PDF (xii, 128 pages) : illustrations.
- Summary
-
- * Preface
- *1. Introduction
- *2. Resilient Distributed Datasets
- *3. Models Built over RDDs
- *4. Discretized Streams
- *5. Generality of RDDs
- *6. Conclusion* References* Author's Biography.
- (source: Nielsen Book Data)
(source: Nielsen Book Data)
- Chambers, Bill (William Andrew), author.
- First edition. - Sebastopol, CA : O'Reilly Media, [2018]
- Description
- Book — 1 online resource (xxvi, 576 pages) : illustrations
- Summary
-
- Part 1. Gentle overview of big data and Spark. What is Apache Spark?
- A gentle introduction to Spark
- A tour of Spark's toolset
- Part 2. Structured APIs : DataFrames, SQL, and datasets. Structured API overview
- Basic structured operations
- Working with different types of data
- Aggregations
- Joins
- Data sources
- Spark SQL
- Datasets
- Part 3. Low-level APIs. Resilient distributed datasets (RDDs)
- Advanced RDDs
- Distributed shared variables
- Part 4. Production applications. How Spark runs on a cluster
- Developint Spark applications
- Deploying Spark
- Monitoring and debugging
- Performance tuning
- Part 5. Streaming. Stream processing fundamentals
- Structured streaming basics
- Event-time and stateful processing
- Structured streaming in production
- Part 6. Advanced analytics and machine learning. Advanced analytics and machine learning overview
- Preprocessing and feature engineering
- Classification
- Regression
- Recommendation
- Unsupervised learning
- Graph analytics
- Deep learning
- Part 7. Ecosystem. Language specifics : Python (PySpark) and R (SparkR and sparklyr)
- Ecosystem and community.
Online 3. Statistical learning under resource constraints [2021]
- Tai, Kai Sheng, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Statistical learning algorithms are an increasingly prevalent component of modern software systems. As such, the design of learning algorithms themselves must take into account the constraints imposed by resource-constrained applications. This dissertation explores resource-constrained learning from two distinct perspectives: learning with limited memory and learning with limited labeled data. In Part I, we consider the challenge of learning with limited memory, a constraint that frequently arises in the context of learning on mobile or embedded devices. First, we describe a randomized sketching algorithm that learns a linear classifier in a compressed, space-efficient form---i.e., by storing far fewer parameters than the dimension of the input features. Unlike typical feature hashing approaches, our method allows for the efficient recovery of the largest magnitude weights in the learned classifier, thus facilitating model interpretation and enabling several memory-efficient stream processing applications. Next, we shift our focus to unsupervised learning, where we study low-rank matrix and tensor factorization on compressed data. In this setting, we establish conditions under which a factorization computed on compressed data can be used to provably recover factors in the original, high-dimensional space. In Part II, we study the statistical constraint of learning with limited labeled data. We first present Equivariant Transformer layers, a family of differentiable image-to-image mappings that improve sample efficiency by directly incorporating prior knowledge on transformation invariances into their architecture. We then discuss a self-training algorithm for semi-supervised learning, where a small number of labeled examples is supplemented by a large collection of unlabeled data. Our method reinterprets the semi-supervised label assignment process as an optimal transportation problem between examples and classes, the solution to which can be efficiently approximated via Sinkhorn iteration. This formulation subsumes several commonly used label assignment heuristics within a single principled optimization framework
- Also online at
-
- Rong, Kexin, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Network telemetry, sensor readings, and other machine-generated data are growing exponentially in volume. Meanwhile, the computational resources available for processing this data -- as well as analysts' ability to manually inspect it -- remain limited. As the gap continues to widen, keeping up with the data volumes is challenging for analytic systems and analysts alike. This dissertation introduces systems and algorithms that focus the limited computational resources and analysts' time in modern data analytics on a subset of relevant data. The dissertation comprises two parts that focus on improving the computational and human efficiency in data analytics, respectively. In the first part of this dissertation, we improve the computational efficiency of analytics by combining precomputation and sampling techniques to select a subset of data that contributes the most to query results. We demonstrate this concept with two approximate query processing systems. PS3 approximates aggregate SQL queries with weighted, partition-level samples based on precomputed summary statistics, whereas HBE approximates kernel density estimations using precomputed hash indexes as smart data samplers. Our evaluation shows that both systems outperform uniform sampling, the best-known result for these queries, with practical precomputation overheads. PS3 enables a 3 to 70x speedup under the same accuracy as uniform partition sampling, with less than 100 KB of storage overhead per partition; HBE offers up to a 10x improvements in query time compared to the second-best method with comparable precomputation time. In the second part of this dissertation, we improve the human efficiency of analytics by automatically identifying and summarizing unusual behaviors in large data streams to reduce the burden of manual inspections. We demonstrate this approach through two monitoring applications for machine-generated data. First, ASAP is a visualization operator that automatically smooths time series in monitoring dashboards to highlight large-scale trends and deviations. Compared to presenting the raw time series, ASAP decreases users' response time for identifying anomalies by up to 44.3% in our user study. We subsequently describe FASTer, an end-to-end earthquake detection system that we built in collaboration with seismologists at Stanford University. By pushing down domain-specific filtering and aggregation into the analytics workflows, FASTer significantly improves the speed and quality of earthquake candidate generation, scaling the analysis from three months of data from a single sensor to ten years of data over a network of sensors. The contributions of this dissertation have had real-world impact. ASAP has been incorporated into open-source tools such as Graphite, TimescaleDB Toolkit, and NPM module downsample. ASAP has also directly inspired an auto smoother for the real-time dashboards at the monitoring service Datadog. FASTer is open-source and has been used by researchers worldwide. Its improved scalability has enabled the discovery of hundreds of new earthquake events near the Diablo Canyon nuclear power plant in California
- Also online at
-
Online 5. Resource-efficient execution of deep learning computations [2021]
- Narayanan, Deepak, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Deep Learning models have enabled state-of-the-art results across a broad range of applications. Training these models, however, is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. As Moore's Law slows down, numerous parallel accelerators have been introduced to meet this new computational demand. This dissertation shows how model- and hardware-aware optimizations in software systems can help intelligently navigate this heterogeneity. In particular, it demonstrates how careful automated scheduling of computation across levels of the software stack can be used to perform distributed training and resource allocation more efficiently. In the first part of this dissertation, we study pipelining, a technique commonly used as a performance optimization in various systems, as a way to perform more efficient distributed model training for both models with small training footprints and those with training footprints larger than the memory capacity of a single GPU. For certain types of models, pipeline parallelism can facilitate model training with lower communication overhead than previous methods. We introduce new strategies for pipeline parallelism, with different tradeoffs between training throughput, memory footprint, and weight update semantics; these outperform existing methods in certain settings. Pipeline parallelism can also be used in conjunction with other forms of parallelism, helping create a richer search space of parallelization strategies. By partitioning the training graph across accelerators in a model-aware way, pipeline parallelism combined with data parallelism can be up to 5x faster than data parallelism in isolation. We also use a principled combination of pipeline parallelism, tensor model parallelism, and data parallelism to efficiently scale training to language models with a trillion parameters on 3072 A100 GPUs (aggregate throughput of 502 petaFLOP/s, which is 52% of peak device throughput). In the second part of this dissertation, we show how heterogeneous compute resources (e.g., different GPU generations like NVIDIA K80 and V100 GPUs) in a shared cluster (either in a private deployment or in the public cloud) should be partitioned among multiple users to optimize objectives specified over one or more training jobs. By formulating existing policies as optimization problems over the allocation, and then using a concept we call effective throughput, policies can be extended to be heterogeneity-aware. A policy-agnostic scheduling mechanism then helps realize the heterogeneity-aware allocations returned by these policies in practice. We can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x, using these heterogeneity-aware policies. Towards the end of this dissertation, we also touch on how the dynamic pricing information of spot instances can be plugged into this heterogeneity-aware policy framework to optimize cost objectives in the public cloud. This can help reduce cost compared to using more expensive on-demand instances alone
- Also online at
-
- Fouladighaleh, Sadjad, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Interactive computing has redefined the way that humans approach and solve complex problems. However, the slowdown of Moore's law, coupled with the massive increase in data volume and sophistication, has prevented some applications from running interactively on microcomputers. Tasks such as video processing, software compilation & testing, 3D rendering, simulations, and data analytics have turned into batch jobs running on large clusters, limiting our ability to tinker, iterate, and collaborate. Meanwhile, by offering an ocean of heterogeneous computing resources, cloud computing has provided us with a unique opportunity to bring interactivity to such applications. This dissertation presents my experiences in creating new systems and abstractions for large-scale interactive computing, where users can execute a wide range of resource-intensive tasks with low latency. My thesis is that commodity cloud platforms can be utilized as an accessible supercomputer-by-the-second for interactive execution of large jobs. By leveraging granular cloud services, users can burst to tens of thousands of parallel computations on demand for short periods of time. I will discuss my experience building such applications for massively burst-parallel video processing, 3D path tracing, software builds, and other tasks. First, I describe ExCamera, a system for low-latency video processing using thousands of tiny threads. ExCamera's core contribution is a video encoder intended for fine-grained parallelism that allows computation to be split into thousands of tiny tasks without harming compression efficiency. Next, I discuss R2E2, a highly scalable path tracer for 3D scenes with high complexity. R2E2's main contribution is an architecture for performing low-latency path-tracing of terabyte-scale scenes using serverless computing nodes in the cloud. This design allows R2E2 to leverage the unique strengths of hyper-elastic cloud platforms (e.g., availability of many CPUs/memory in aggregate) and mitigates their limitations (e.g., low per-node memory capacity and high latency inter-node communication). Finally, drawing on the experience of building burst-parallel applications like ExCamera, I describe gg, a framework designed to facilitate the implementation of burst-parallel algorithms on serverless platforms. gg specifies an intermediate representation that allows a diverse class of applications to be abstracted from the computing and storage platform, and leverage common services for dependency management, straggler mitigation, and scheduling. Using gg IR, the developers express their applications in terms of the relationships between code and data, while the framework carries the burden of efficiently executing them
- Also online at
-
Online 7. Data summaries for scalable, high-cardinality analytics [2020]
- Gan, Edward R, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
Sensor data, network requests, and other machine generated data continue to grow in both volume and dimensionality. Users now commonly aggregate over billions of records segmented by fine-grained time windows or a long tail of device types. To support this growth, an emerging class of data systems including Apache Druid and Apache Kylin precompute approximate summaries for different segments of a dataset, partitioned by contextual dimensions. These systems can avoid expensive full data scans by operating directly over segment summaries. However, existing streaming and mergeable summarization techniques do not scale to settings where summaries must be aggregated over hundreds or millions of segments. In this thesis, we develop data summaries tailored for high-cardinality aggregation. We find that an end-to-end approach to algorithm design, where data summaries and aggregation are designed for specific high-cardinality workloads, can yield dramatic improvements in accuracy, space usage, and runtime. We demonstrate this approach with three summarization techniques. First, we address runtime bottlenecks for quantile queries by introducing the Moments sketch, which uses compact moment statistics and efficient distribution solvers to reduce aggregation and query overhead compared to existing mergeable summaries. Second, we address accuracy and space bottlenecks for item frequency and quantile queries by developing Cooperative summaries optimized for aggregate (rather than individual) summary error, improving upon the error guarantees provided by mergeable summaries. Third, we address runtime bottlenecks for thresholded kernel density estimation queries by extending k-d tree summaries with a new aggregation procedure, TKDC, and show that it achieves asymptotic improvements in runtime compared with direct computation
- Also online at
-
Online 8. Strong privacy for communication, browsing, and storage [2021]
- Eskandarian, Saba, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
As news of large-scale privacy compromises becomes increasingly frequent, there is a need for new tools to protect user privacy across all the technologies we use in our daily lives. This thesis develops new approaches to protecting user data as it is typed into a web browser, when it is transmitted in messaging conversations, and when it resides in cloud storage. The techniques used will vary from new system designs using hardware enclaves to novel applications of zero-knowledge proof systems, but for each application the thesis will show how to achieve order of magnitude performance improvements that enable previously out of reach use cases
- Also online at
-
Online 9. Automated discovery of machine learning optimizations [2020]
- Jia, Zhihao, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
The increasing complexity of machine learning (ML) models and ML-specific hardware architectures makes it increasingly challenging to build efficient and scalable ML systems. Today's ML systems heavily rely on human effort to optimize the deployment of ML models on modern hardware platforms, which requires a tremendous amount of engineering effort but only provides suboptimal runtime performance. Moreover, the rapid evolution of ML models and ML-specific hardware makes it infeasible to manually optimize performance for all model and hardware combinations. In this dissertation, we propose a search-based methodology to build performant ML systems by automatically discovering performance optimizations for ML computations. Instead of only considering the limited set of manually designed performance optimizations in current ML systems, our approach introduces a significantly more comprehensive search space of possible strategies to optimize the deployment of an ML model on a hardware platform. In addition, we design efficient search algorithms to explore the search space and discover highly-optimized strategies. The search is guided by a cost model for evaluating the performance of different strategies. We also propose a number of techniques to accelerate the search procedure by leveraging the topology of the search space. This dissertation presents three ML systems that apply this methodology to optimize different tasks in ML deployment. Compared to current ML systems relying on manually designed optimizations, our ML systems enable better runtime performance by automatically discovering novel performance optimizations that are missing in current ML systems. Moreover, the performance improvement is achieved with less engineering effort, since the code needed for discovering these optimizations is much less than manual implementation of these optimizations. First, we developed TASO, the first ML graph optimizer that automatically generates graph optimizations. TASO formally verifies the correctness of the generated graph optimizations using an automated theorem prover, and uses cost-based backtracking search to discover how to apply the verified optimizations. In addition to improving runtime performance and reducing engineering effort, TASO also provides correctness guarantees using formal methods. Second, to generalize and go beyond today's manually designed parallelization strategies for distributed ML computations, we introduce the SOAP search space, which contains a comprehensive set of possible strategies to parallelize ML computations by identifying parallelization opportunities across different Samples, Operators, Attributes, and Parameters. We developed FlexFlow, a deep learning engine that automatically searches over strategies in the SOAP search space. FlexFlow includes a novel execution simulator to evaluate the runtime performance of different strategies, and uses a Markov Chain Monte Carlo (MCMC) search algorithm to find performant strategies. FlexFlow discovers strategies that significantly outperform existing strategies, while requiring no manual effort during the search procedure. Finally, we developed Roc, which automates data placement optimizations and minimizes data transfers in the memory hierarchy for large-scale graph neural network (GNN) computations. Roc formulates the task of optimizing data placement as a cost minimization problem and uses a dynamic programming algorithm to discover a globally optimal data management plan that minimizes data transfers between memories
- Also online at
-
Online 10. Interfaces for efficient software composition on modern hardware [2020]
- Palkar, Shoumik Prasad, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
For decades, developers have been productive writing software by composing optimized libraries and functions written by other developers. Though hardware trends have evolved significantly over this time---with the ending of Moore's law, the increasing ubiquity of parallelism, and the emergence of new accelerators---many of the common interfaces for composing software have nevertheless remained unchanged since their original design. This lack of evolution is causing serious performance consequences in modern applications. For example, the growing gap between memory and processing speeds means that applications that compose even hand-tuned libraries can spend more time transferring data through main memory between individual function calls than they do performing computations. This problem is even worse for applications that interface with new hardware accelerators such as GPUs. Though application writers can circumvent these bottlenecks manually, these optimizations come at the expense of programmability. In short, the interfaces for composing even optimized software modules are no longer sufficient to best use the resources of modern hardware. This dissertation proposes designing new interfaces for efficient software composition on modern hardware by leveraging algebraic properties intrinsic to software APIs to unlock new optimizations. We demonstrate this idea with three new composition interfaces. The first interface, Weld, uses a functional intermediate representation (IR) to capture the parallel structure of data analytics workloads underneath existing APIs, and enables powerful data movement optimizations over this IR to optimize applications end-to-end. The second, called split annotations (SAs), also focuses on data movement optimization and parallelization, but uses annotations on top of existing functions to define an algebra for specifying how data passed between functions can be partitioned and recombined to enable cross-function pipelining. The third, called raw filtering, optimizes data loading in data-intensive systems by redefining the interface between data parsers and query engines to improve CPU efficiency. Our implementations of these interfaces have shown substantial performance benefits in rethinking the interface between software modules. More importantly, they have also shown the limitations of existing established interfaces. Weld and SAs show that a new interface can accelerate data science pipelines by over 100x in some cases in multicore environments, by enabling data movement optimizations such as pipelining on top of existing libraries such as NumPy and Pandas. We also show that Weld can be used to target new parallel accelerators, such as vector processors and GPUs, and that SAs can enable these speedups even on black-box libraries without any library code modification. Finally, the I/O optimizations in raw filtering show over 9x improvements in end-to-end query execution time in distributed systems such as Spark SQL when processing semi-structured data such as JSON
- Also online at
-
Online 11. NanoLog : a nanosecond scale logging system [2020]
- Yang, Stephen, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
Instrumentation is fundamental to developing and maintaining applications in the modern datacenter. It affords visibility into what an application is doing at runtime, and it helps pin-point bugs in a system by exposing the steps that lead to an error. The most common method of instrumentation today is logging, or printing out human-readable messages during an application's execution. Unfortunately, as applications have evolved to become increasingly more performant with tighter latency requirements, traditional logging systems have not kept up. As a result, the cost of producing human-readable log messages is becoming prohibitively expensive. NanoLog is a nanosecond scale logging system that's 1-2 orders of magnitude faster than existing logging systems such as Log4j2, spdlog, Boost log, or Event Tracing for Windows. The system achieves a throughput of up to 82 million log messages per second for simple log messages and has a typical log invocation overhead of 8 nanoseconds. For comparison, other modern logging applications today can only achieve up to a few million log messages per second at log latencies of hundreds of nanoseconds to several microseconds. NanoLog achieves its ultra-low latency and high throughput by shifting work out of the runtime hot-path and into the compilation and post-execution phases of the application. More specifically, it performs compile-time extraction of static information from the log messages to reduce I/O and decouples formatting of the log messages from the runtime application by deferring it until after execution. The result is an optimized runtime that only outputs the minimal amount of data and produces a compact, binary log file. The binary log file is also amenable to log analytics engines; it is small relative to full, human-readable log messages and contains all the data in a binary format, saving the engine from parsing ASCII text. With these enhancements, NanoLog enables nanosecond scale logging and hopes to fill the performance gap left between traditional logging systems of today and the next generation applications of tomorrow
- Also online at
-
- Qin, Henry, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
Arachne is a new user-level implementation of threads that provides both low latency and high throughput for applications with extremely short-lived threads (only a few microseconds). Arachne is core-aware: each application determines how many cores it needs, based on its load; it always knows exactly which cores it has been allocated, and it controls the placement of its threads on those cores. A central core arbiter allocates cores between applications. Adding Arachne to memcached improved SLO-compliant throughput by 37%, reduced tail latency by more than 10x, and allowed memcached to coexist with background applications with almost no performance impact. Adding Arachne to the RAMCloud storage system increased its write throughput by more than 2.5x. The Arachne threading library is optimized to minimize cache misses; it can initiate a new user thread on a different core (with load balancing) in 320 ns. Arachne is implemented entirely at user level on Linux; no kernel modifications are needed.
- Also online at
-
Online 13. Protecting privacy by splitting trust [2019]
- Corrigan-Gibbs, Henry Nathaniel, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
In this dissertation, we construct two systems that protect privacy by splitting trust among multiple parties, so that the failure of any one, whether benign or malicious, does not cause a catastrophic privacy failure for the system as a whole. The first system, called Prio, allows a company to collect aggregate statistical data about its users without learning any individual user's personal information. The second, called Riposte, is a system for metadata-hiding communication that allows its users to communicate over an insecure network without revealing who is sending messages to whom. Both systems defend against malicious behavior using zero-knowledge proofs on distributed data, a cryptographic tool that we develop from a new type of probabilistically checkable proof. The two systems that we construct maintain their security properties in the face of an attacker who can control the entire network, an unlimited number of participating users, and any proper subset of the servers that comprise the system. These systems split trust in the sense that, as long as an attacker cannot compromise all of the participating servers, the system provides "best-possible" protection of the confidentiality of user data. Through the design, implementation, and deployment of these systems, we show that it is possible for us to enjoy the benefits of modern computing while protecting the privacy of our data.
- Also online at
-
Online 14. A relational architecture for graph, linear algebra, and business intelligence querying [2018]
- Aberger, Christopher R., author.
- [Stanford, California] : [Stanford University], 2018.
- Description
- Book — 1 online resource.
- Summary
-
Modern analytics workloads extend far beyond the SQL-style business intelligence queries that relational database management systems (RDBMS) were designed to process efficiently. As a result, RDBMSs often incur orders of magnitude performance gaps with the best known implementations on modern analytics workloads such as graph analysis and linear algebra queries. The relational model is therefore largely forsaken on such workloads, resulting in a flurry of activity around designing specialized (low-level) graph and linear algebra packages. In this dissertation we present a new type of relational query processing architecture that overcomes these shortcomings of traditional relational architectures. To do this we present a new in-memory query processing engine called EmptyHeaded. EmptyHeaded uses a new, worst-case optimal (multiway) join algorithm as its core execution mechanism, making it fundamentally different from nearly every other relational architecture. With EmptyHeaded, we show how the crucial optimizations for graph analysis, linear algebra, and business intelligence workloads can be captured in such a novel relational architecture. The work presented in this dissertation shows that, unlike traditional RDBMSs, this new type of relational query processing architecture is capable of delivering efficient performance in multiple application domains.
- Also online at
-
Special Collections
Special Collections | Status |
---|---|
University Archives | Request on-site access (opens in new tab) |
3781 2018 A | In-library use |
Online 15. Resource and data efficient deep learning [2021]
- Coleman, Cody Austun, author.
- [Stanford, California] : [Stanford University], 2021
- Description
- Book — 1 online resource
- Summary
-
Using massive computation, deep learning allows machines to translate large amounts of data into models that accurately predict the real world, enabling powerful applications like virtual assistants and autonomous vehicles. As datasets and computer systems have continued to grow in scale, so has the quality of machine learning models, creating an expensive appetite in practitioners and researchers for data and computation. To address this demand, this dissertation discusses ways to measure and improve both the computational and data efficiency of deep learning. First, we introduce DAWNBench and MLPerf as a systematic way to measure end-to-end machine learning system performance. Researchers have proposed numerous hardware, software, and algorithmic optimizations to improve the computational efficiency of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision) and can even impact the final model's accuracy on unseen data. Because of these trade-offs between accuracy and computational efficiency, it has been difficult to compare and understand the impact of these optimizations. We propose and evaluate a new metric, time-to-accuracy, that can be used to compare different system designs and use it to evaluate high performing systems by organizing two public benchmark competitions, DAWNBench and MLPerf. MLPerf has now grown into an industry standard benchmark co-organized by over 70 organizations. Second, we present ways to perform data selection on large-scale datasets efficiently. Data selection methods, such as active learning and core-set selection, improve the data efficiency of machine learning by identifying the most informative data points to label or train on. Across the data selection literature, there are many ways to identify these training examples. However, classical data selection methods are prohibitively expensive to apply in deep learning because of the larger datasets and models. To make these methods tractable, we propose (1) "selection via proxy" (SVP) to avoid expensive training and reduce the computation per example and (2) "similarity search for efficient active learning and search" (SEALS) to reduce the number of examples processed. Both methods lead to order of magnitude performance improvements, making techniques like active learning on billions of unlabeled images practical for the first time
- Also online at
-
Online 16. Scaling a reconfigurable dataflow accelerator [2020]
- Zhang, Yaqi, author.
- [Stanford, California] : [Stanford University], 2020
- Description
- Book — 1 online resource
- Summary
-
With the slowdown of Moore's Law, specialized hardware accelerators are gaining traction for delivering 100-1000x performance improvement over general-purpose processors in a variety of applications domains, such as cloud computing, biocomputing, artificial intelligence, etc. As the performance scaling in multicores is coming to a limit, a new class of accelerators---reconfigurable dataflow architectures (RDAs)---offers high-throughput and energy-efficient acceleration that keeps up with the performance demand. Instead of dynamically fetching instructions as traditional processors do, RDAs have flexible data paths that can be statically configured to spatially parallelize and pipeline programs across distributed on-chip resources. The pipelined execution model and explicitly-managed scratchpad in RDAs eliminate the performance, area, and energy overhead from dynamic scheduling and conventional memory hierarchy. To adapt to the compute intensity in modern data-analytic workloads particularly in the deep learning domain, RDAs have increased to an unprecedented scale. With an area footprint of 133mm^2 at 28nm, Plasticine is a previously proposed large-scale RDA supplying 12.3 TFLOPs of computing power. Prior work has shown an up to 76x performance/watt benefit from Plasticine over a Stradix V FPGA due to an advantage in clock frequency and resource density. The increase in scale introduces new challenges in network-on-chip design to maintain the throughput and energy efficiency of an RDA. Furthermore, targeting and managing RDAs at this scale require new strategies in mapping, memory management, and flexible control to fully utilize their compute power. In this work, we focus on two aspects of the software-hardware co-design that impact the usability and scalability of the Plasticine accelerator. Although RDAs are flexible to support a wide range of applications, the biggest challenge that hinders the adoption of these accelerators is the required low-level knowledge in microarchitecture design and hardware constraints in order to efficiently map a new application. To address this challenge, we introduce a compiler stack--SARA--that raises the programming abstraction of Plasticine to an imperative-style domain-specific language with nested control flow for general spatial architectures. The abstraction is architecture-agnostic and contains explicit loop constructs that enable cross-kernel optimizations often not exploited on RDAs. SARA efficiently translates imperative control constructs to a streaming dataflow graph that scales performance with distributed on-chip resources. By virtualizing resources, SARA systematically handles hardware constraints, hiding the low-level architecture-specific restrictions from programmers. To address the scalability challenge with increasing chip size, we present a comprehensive study on the network-on-chip design space for RDAs. We found that network performance highly correlates with bandwidth instead of latency for RDAs with a streaming dataflow execution model. Lastly, we show that a static-dynamic hybrid network design can sustain performance in a scalable fashion with high energy efficiency
- Also online at
-
- Koeplinger, David Alan, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
In recent years, the computing landscape has seen an increasing shift towards specialized accelerators. Reconfigurable architectures like field programmable gate arrays (FPGAs) are particularly promising for accelerator implementation as they can offer performance and energy efficiency improvements over CPUs and GPUs while offering more flexibility than fixed-function ASICs. Unfortunately, adoption of reconfigurable hardware has been limited by their associated tools and programming models. The conventional languages for programming FPGAs, hardware description languages (HDLs), lack abstractions for productivity and are difficult to target directly from higher level languages. Commercial high level synthesis (HLS) tools offer a more productive programming solution, but their mix of software and hardware abstractions is ad-hoc, making both manual and automated performance optimizations difficult. As demand for customized accelerators has grown, so too has the demand by software engineers and domain experts for domain-specific languages (DSLs) which provide higher levels of abstraction and hence improved programmer productivity. Unfortunately, in the domains of machine learning and data analytics, most domain-specific methods for generating accelerators are focused on library-based approaches which generate hardware on a per-kernel basis, resulting in excessive memory transfers and missing critical cross-kernel optimizations. As DSLs become more ubiquitous, this approach will not scale. This dissertation describes a new system for compiling high-level applications in domain specific languages to hardware accelerator designs that addresses these productivity, generality, and optimization challenges. To improve results above kernel-based approaches when programming in domain-specific languages, we introduce a waypoint between DSLs and HDLs: a new intermediate abstraction dedicated to representing parameterized accelerator designs targeting reconfigurable architectures. Starting from a common intermediate representation for high-level DSLs based on parallel patterns, we first describe the cross-kernel optimizations the system performs and the methods which used to lower the entire application graph into a parameterized design in the intermediate hardware abstraction. We then describe our implementation of this intermediate abstraction, a language and compiler called Spatial. We discuss some of the compiler optimizations Spatial enables, including rapid design parameter tuning, pipeline scheduling, and memory banking and partitioning. The end result is a compiler stack which can take as input a high-level program in a domain-specific language and translate it into an optimized, synthesizable hardware design coupled with runtime administration code for the host CPU.
- Also online at
-
Online 18. Fast, elastic storage for the cloud [2019]
- Klimovic, Ana, author.
- [Stanford, California] : [Stanford University], 2019.
- Description
- Book — 1 online resource.
- Summary
-
Cloud computing promises high performance, cost-efficiency, and elasticity --- three essential goals when processing exponentially growing datasets. To meet these goals, cloud platforms must provide each application with the right amount and balance of compute and fast storage resources (e.g., NVMe Flash storage). However, providing the right resources to applications is challenging today because server machines have a fixed ratio of compute and storage resources, remote access to fast storage leads to significant performance and cost overheads, and storage requirements vary significantly over time and across applications. This dissertation focuses on how to build high performance, cost-effective, and easy-to-use cloud storage systems. First, we discuss how to provide fast access to remote Flash storage so that the balance of compute and storage allocation is not limited by the physical characteristics of server hardware. We present ReFlex, a custom network-storage operating system that provides fast access to modern Flash storage over commodity cloud networks. ReFlex enables storage devices to be shared among multiple tenants with predictable performance. Second, we discuss how to implement intelligent allocation of storage resources. We present Pocket, a distributed storage service that combines the fast remote data access in ReFlex with automatic resource allocation for serverless analytics workloads. Finally, we discuss the potential of using machine learning to automate resource allocation decisions with a system called Selecta that recommends a near-optimal cloud resource configuration for a job based on sparse training data collected across multiple jobs and configurations.
- Also online at
-
- Thomas, James Joe, author.
- [Stanford, California] : [Stanford University], 2022
- Description
- Book — 1 online resource
- Summary
-
FPGAs have grown in popularity as compute accelerators in recent years, being deployed in the clouds of Amazon, Alibaba, and Microsoft for either internal tasks like networking acceleration or directly for customers to rent. As their use has grown, it has become increasingly clear that they are fairly unproductive for developers compared to competing platforms like GPUs and CPUs. We aim to close this gap in this dissertation by taking a domain-specific approach. We argue that general-purpose FPGA development tools are fundamentally limited by the complexity of the platform, and therefore focus on building faster and simpler tools for the specific case of streaming data-intensive applications. We first present Fleet, a system for accelerating massively parallel streaming workloads on FPGAs. Fleet provides a simple language for users to define a compute unit that processes a single stream of data, and then automatically replicates the compute unit many times into a memory controller fabric so that the final design can process many independent streams at once. We next present a fast compilation system for Fleet-like applications. Our system leverages the fact that the memory controller design is fixed across applications, and therefore compiles it ahead of time, leaving empty slots for copies of the user's processing unit. The user-visible compile time is thus reduced only to the time required to compile a few copies of the processing unit and replicate them into the prebuilt memory controller. Finally, we leverage the design patterns of identical compute units and streaming DRAM access to design a FPGA accelerator for the problem of finding interesting subgroups in large tabular datasets. This accelerator is able to outperform GPUs and CPUs on a cost per throughput basis due to its customized partitioning of SRAM resources across compute units
- Also online at
-
20. <>. [2015]
- 初めてのSpark
- Karau, Holden, author.
- 1st edition. - O'Reilly Japan, Inc., 2015.
- Description
- Book — 1 online resource (312 pages) Digital: text file.
Articles+
Journal articles, e-books, & other e-resources
Guides
Course- and topic-based guides to collections, tools, and services.