%{search_type} search results

54 catalog results

RSS feed for this result
Book
xvi, 145 p.
SAL1&2 (on-campus shelving), SAL3 (off-campus storage), Special Collections
Book
xx, 97 p.
SAL3 (off-campus storage), Special Collections
Book
xvi, 132 p.
SAL3 (off-campus storage), Special Collections
Book
xix, 153 leaves, bound.
SAL3 (off-campus storage), Special Collections
Book
xix, 170 leaves, bound.
SAL3 (off-campus storage), Special Collections
Book
xxiii, 152 p.
SAL3 (off-campus storage), Special Collections
Book
viii, 191 p. : ill. ; 24 cm.
This book presents the thoroughly refereed post-proceedings of the Second International Workshop on Intelligent Memory Systems, IMS 2000, held in Cambridge, MA, USA, in November 2000. The nine revised full papers and six poster papers presented were carefully reviewed and selected from 28 submissions. The papers cover a wide range of topics in intelligent memory computing; they are organized in topical sections on memory technology, processor and memory architecture, applications and operating systems, and compiler technology.
(source: Nielsen Book Data)9783540423287 20160528
link.springer-ny.com requires Adobe Acrobat software to view PDF files
SAL3 (off-campus storage)
Book
1 online resource.
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. High-level programming frameworks based on parallel patterns have recently become a popular solution to raise the level of abstraction and provide implicitly parallel execution on a variety of architectures. Portable performance is often still difficult to achieve however due to the system's inability to optimize programs across data structure abstractions and nested parallelism. In this dissertation, I introduce the Delite Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. Combined with a straightforward array-based data structure model, the language semantics naturally capture a set of powerful transformations over nested parallel patterns that restructure computation to enable distribution and optimize for heterogeneous devices. These transformations enable improved single-threaded performance, greater parallel scalability, smaller memory footprints, transparently targeting distributed memory architectures, and automated data movement and distribution. I also present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in alternative systems.
Book
1 online resource.
Approximate computing (AC) is a very promising design paradigm for crossing the CPU power wall, primarily driven by the potential to sacrifice application output quality for significant gains in performance, energy, and fault tolerance. AC exploits the tolerance of many application domains (e.g., multimedia processing, data mining, and scientific computing) to errors and/or low-precision in their computations. However, existing solutions on the software side have not thoroughly explored the compilation and runtime stages, which have a critical impact on system performance. This work introduces a software-only, general-purpose acceleration framework that utilizes neural networks (NNs) and denoising autoencoders to restructure an application's data flow into a hybrid of exact and approximate computations. Specifically, this paper proposes advanced compilation and runtime techniques that solve the very difficult challenge of utilizing multiple interacting subtask approximators. Unlike previous work, our system exploits the hierarchical structure of an application to introduce significant flexibility in how the approximations are performed. Additionally, the framework is able to automatically generate an application.s subroutine structure in order to minimize design costs. By restructuring algorithms to have a mixed exact-NN data flow, EMEURO is able to achieve significant speedup across several domains, achieving 7x-109x maximum speedup over the original algorithm, with 0.1%-10% approximation error. Design costs are significantly reduced by allowing the NNs to learn the application.s functionality rather than being explicitly programmed and optimized by a human. NNs have also been shown to be very fault tolerant, which is particularly important in high-performance systems, where sophisticated designs can yield complex and hard to trace bugs. Although this work focuses on CPU-only acceleration, the very data-parallel nature of NNs makes them amenable to running efficiently on many different acceleration platforms.
Book
1 online resource.
Technology trends and architectural developments continue to enable an increase in the number of cores available on multicore chips. Certain types of applications easily take advantage of increased concurrency; but there are important application domains that struggle to do so. Such irregular applications, often related to sparse datasets or graph processing, have time-variable, hard-to-predict parallelism. They tend to resist static parallelization, requiring instead dynamic load-balancing layers. These task-management layers by necessity rely more strongly on communication and synchronization than regular applications do, a problem that is only accentuated by increasing on-chip communication delays. As core counts reach into the hundreds, more careful handling of this synchronization becomes necessary to avoid bottlenecks and load imbalances. We explore solutions to these issues in two dimensions: On the software front, we design a task-stealing runtime layer for irregular parallel applications on hundred-core chips. By setting up fast and efficient ways to share information, the runtime is able to embrace the varying relationship between available parallelism and core count, adapting dynamically to both abundance and scarcity of work. When tested with representative sparse-data and graph-processing applications on 128 cores, runtime overhead is reduced by 60%, simultaneously achieving 15% faster execution and 29% lower system utilization. This is done without hardware assistance, unlike prior solutions. On the hardware front, we address the 'latency multiplying'' effects of relying on coherent caches for inter-core communication and synchronization. Our proposed techniques enable efficient implementation of synchronization and data-sharing by reducing the number and complexity of cache-coherence transactions required. Using ISA hints to express preferred placement of both data access and atomic operations, faster and more efficient communication patterns can result. Our solution is compatible with existing architectures, requires only minimal code changes, and presents few practical implementation barriers. We show experimentally on 128 cores that these tools yield many-fold speedups to benchmarks and cut the optimized task-stealing runtime's remaining overhead, for a 70% speedup to runtime stealing and 18% to actual applications, along with increased energy efficiency. The overall result is to significantly improve the viability of irregular parallelism on many-core chips. We minimize waste and enable faster on-chip communication, by implementing smarter, more context-aware software task management, and by enabling more efficient communication patterns at the cache coherence level.
Special Collections
Book
1 online resource.
Graphs are powerful data representations favored in many computational domains. Analytics and knowledge extraction on these data structures has become an area of great interest, particularly with the increasing prevalence of large data sets now becoming commonplace in data centers. Due to the scale of these operations, energy efficiency on frequent tasks such as graph analysis will be very important as the domain and the data sizes continue to grow. Dedicated hardware accelerators are one high-performing yet energy-efficient approach to this problem. Unfortunately, they are notoriously labor-intensive to design and verify while meeting stringent time-to-market goals. This thesis presents GraphOps, a modular hardware library for quickly and easily constructing energy-efficient accelerators for graph analytics algorithms. GraphOps provide a hardware designer with a set of composable graph-specific building blocks broad enough to target a wide array of graph analytics algorithms. The system is built upon a streaming execution platform and targets FPGAs, allowing a vendor to use the same hardware to accelerate different types of analytics computation. Stubborn hardware implementation details such as flow control, input buffering, rate throttling, and host/interrupt interaction are automatically handled and built into the design of the GraphOps, greatly reducing design time. As an enabling contribution, this thesis presents a novel locality-optimized graph data structure that improves the efficiency of memory access to the graph.
Special Collections
Book
1 online resource.
Datacenter workloads have demanding performance requirements, including the simultaneous need for high throughput, low tail latency, and high server utilization. While modern hardware is compatible with these goals, contemporary operating systems are not. The conventional wisdom is that aggressive networking requirements, such as high packet rates for small messages and microsecond-scale tail latency, are best addressed outside the kernel in a user-level networking stack. In this dissertation, I will first discuss IX, a dataplane operating system that provides high I/O performance while maintaining the key advantage of strong protection offered by existing kernels. IX separates the management and scheduling functions of the kernel (the control plane) from the network processing (the dataplane). The dataplane architecture builds upon a native, zero-copy API and optimizes for both bandwidth and latency by dedicating hardware threads and networking queues to dataplane instances. Each dataplane instance is designed to process bounded batches of packets to completion without depending on coherence traffic or multi-core synchronization. IX outperforms Linux significantly in both throughput and end-to-end latency. For example, IX can improve Memcached's TCP throughput by up to 3.6x. While IX demonstrates that better operating system abstractions can significantly improve performance, deploying these abstractions has become intractable given the size and complexity of today's systems. To address this second challenge, I will present Dune, a kernel extension that allows operating system developers to sidestep software and hardware complexity by running an operating system within an ordinary Linux process. With Dune, developers can build custom library operating systems that safely access the full capabilities of raw hardware while also falling back on the convenience and functionality of a Linux environment. Dune uses CPU virtualization extensions to expose access to privileged instructions without sacrificing existing process security and isolation properties.
Special Collections
Book
1 online resource.
Application specific processors exploit the structure of algorithms to reduce energy costs and increase performance. These kinds of optimizations have become more and more important as the historical trends in technology scaling and energy scaling have slowed or stopped. Image processing and computer image understanding algorithms contain the kinds of embarrassingly parallel structures that application specific processors can exploit. Further, these algorithms have very high compute demands, which makes efficient computation critical. So these specialized processors are found on many SoCs today. Yet, these image processors are hard to design and program, which slows architectural innovation. To address this issue we leverage the fact that most image applications can be composed as a set of ``stencil'' kernels and then provide a virtual machine model for stencil computation onto which many applications in the domains of image signal processing, computational photography, and computer vision can be mapped. Stencil kernels are a class of function (e.g. convolution) in which a given pixel within an output image is calculated from a fixed-size sliding window of pixels in its corresponding input image. This fixed window in the input data, where each data element is reused between concurrent computations, allows for a significant reduction in memory traffic through buffering and provides much of the efficiency in specialized image processors. Additionally, the predictable data flow for stencil kernels, allows for the producer consumer relationships between stencil kernels in large applications to be statically determined and exploited, further reducing memory traffic. Finally, the functional nature of the computation and the significant number of times it is invoked allows for the implementation of the computation to be highly optimized. Stencil kernels play a recurring role in image signal processing, computer vision, and computational photography. Any process that creates a filter, constructs low level image features, evaluates relationships of nearby pixels or features, etc. is implementable as a stencil kernel. Many applications in the domain image processing and understanding are built by cascading these operations (e.g. filtering noise, looking for local features and local segments, then localizing regions and objects from those segments and features). These applications also play a significant role in society, whether it is to automate the home, car, or factory or to improve the capabilities of our mobile devices in capturing and understanding the world around us. While the computation model may seem restrictive and domain specific any improvement in the efficiency of this computation for this domain would permeate many fields and society increasing the capability and decrease the cost of innovation and progress. When applications are written in a domain specific language restricted to stencil computation, it can be compiled to the stencil virtual machine model proposed in this thesis. This model allows for an application's behavior to be specified without knowledge of the underlying system implementation. Conversely, such a model allows for a great degree of flexibility in the implementation of that underlying system, which provides opportunity for optimization. The input to this virtual machine model is an intermediate language called Data Path Description Assembler (DPDA), which represents a compiler target for high level languages. While many hardware-software systems implement the virtual machine and execute DPDA, this thesis presents a method to generate fixed function hardware from DPDA code. The resulting hardware is two orders of magnitude more efficient than a comparable CPU or GPU implementation. This hardware generator greatly reduces cost of designing customized engine for new imaging applications, and also serves as a critical reference for research exploring the overheads of more flexible compute engines.
Book
1 online resource.
The performance of microprocessors has grown by three orders of magnitude since their beginnings in the 1970s; however, this exponential growth in performance has been achieved without overcoming substantial obstacles. These obstacles were over- come due in large part of the exponential increases in the amount of transistors available to architects as transistor technology scaled. Many today call the largest of the hurdles impeding performance gain "walls". Such walls include the Memory Wall, which is memory bandwidth and latency not scaling with processor performance; the Power Wall, which is the processor generating too much heat to be feasibly cooled; and the ILP wall, which is the diminishing return seen when making processor pipelines deeper due to the lack of available instruction level parallelism. Today, computer architects continually overcome new walls to extend this exponential growth in performance. Many of these walls have been circumvented by moving from large monolithic architectures to multi-core architectures. Instead of using more transistors on bigger, more complicated single processors, transistors are partitioned into separate processing cores. These multi-core processors require less power and are better able to exploit data level parallelism, leading to increased performance for a wide range of applications. However, as the number of transistors available continues to increase, the current trend of increasing the number of homogeneous cores will soon run into a "Capability Wall" where increasing the core count will not increase the capability of a processor as much as it has in the past. Amdahl's law limits the scalability of many applications and power constraints will make it unfeasible to power all the transistors available at the same time. Thus, the capability of a single processor chip to compute more things in a given time slot will stop improving unless new techniques are developed. In this work, we study how to build hardware components that provide new capabilities by performing specific tasks more quickly and with less power then general purpose processors. We explore two broad classes of such domain specific hardware accelerators: those that require fine-grained communication and tight coupling with the general purpose computation and those that require much a looser coupling with the rest of the computation. To drive the study, we examine a representative example in each class. For fine-grained accelerators, we present a transactional memory accelerator. We see that dealing with the latency and lack of ordering in the communication channel between the processor and accelerator presents significant challenges to efficiently accelerating transactional memory. We then present multiple techniques that over- come these problems, resulting in an accelerator that improves the performance of transactional memory application by an average of 69%. For course-grained loosely coupled accelerators, we turn to accelerating database operations. We discuss that since these accelerators are often dealing with large amounts of data, one of the key attributes of a useful database accelerator is the ability to fully saturate the bandwidth available to the system's memory. We provide insight into how to design an accelerator that does so by looking at designs to perform selection, sorting, and joining of database tables and how they are able to make the most efficient use of memory bandwidth.
Special Collections
Book
1 online resource.
Over the past couple of decades GPUs have enjoyed tremendous scaling in both functionality and performance by focusing on area efficient processing. However, the slowdown in supply voltage scaling has created a new hurdle to continued scaling of GPU performance. This slowdown in voltage scaling has caused power consumption to limit the achievable GPU performance. Since GPUs currently use many of the well-known hardware techniques for reduced power consumption, GPU designers need to start looking at architectural techniques to improve energy efficiency. To enable this exploration, we create an accurate power model of GPU architectures and apply this model to explore a couple of methods to save power. As part of these studies we will look at overdraw (which occurs when a given pixel's value is computed more than once) and thread level redundancy in the shader processor of the GPU. Through the use of our model and GPU performance data, we will show that significant opportunities exist for improving energy efficiency. These studies demonstrate both the utility of our power model, and the potential of architectural changes to make GPUs more energy efficient.
Special Collections
Book
1 online resource.
Cloud computing is at a critical juncture. An increasing amount of computation is now hosted in private and public clouds. At the same time, datacenter resource efficiency, i.e., the effective utility we extract from system resources has remained notoriously low, with utilization rarely exceeding 20-30%. Low utilization coupled with the lack of scaling in hardware due to technology limitations poses threatening scalability roadblocks for cloud computing. At a high level, two main reasons hinder efficient scalability in datacenters. First, the reservation-based interface through which resources are currently allocated is fundamentally flawed. Users must determine how many resources a new application requires to meet its quality of service (QoS) constraints. Unfortunately this is extremely difficult for users that tend to overprovision their reservations, resulting in mostly-allocated, but lightly-utilized systems. Second, underutilization is aggravated by performance unpredictability; the result of heterogeneity in hardware platforms, interference between applications contending in shared resources, and spikes in input load. Unpredictability results in further resource overprovisioning by users. The focus of this dissertation is to enable efficient, scalable and performance-aware datacenters with tens to hundreds of thousands of machines by improving cluster management. To this end, we present contributions that address both the system-user interface, and the complexity of resource management at scale. These techniques are directly applicable to current systems, with modest design alterations. We first present a new declarative interface between users and cluster manager that centers around performance, instead of resource reservations. This enables users to focus on the high level performance objectives an application must meet, as opposed to the intrinsics on how these objectives should be achieved using low level resources. On the system side, we make two fundamental contributions. First, we design a practical system that leverages data mining to quickly understand the resource requirements of incoming applications in an online manner. We establish that resource management at this scale cannot be solved with the traditional trial-and-error approach of conventional architecture and system design. We show that instead we can introduce data mining principles which leverage the knowledge the system accumulates over time from incoming applications, to significantly benefit both performance and efficiency. We first use this approach in Paragon to tackle the platform heterogeneity and workload interference challenges in datacenter management. The cluster manager relies on collaborative filtering to identify the most suitable hardware platform for a new, unknown application and its sensitivity to interference in various shared resources. We then extend a similar approach to address the larger problem of resource assignment and resource allocation with Quasar. To ensure minimal management overheads, we decompose the problem to four dimensions; platform heterogeneity, application interference, resource scale-up and scale-out. This enables the majority of applications to meet their QoS targets, while operating at 70% utilization, on a cluster with several hundred servers. In contrast, a reservation-based system rarely exceeds 15-20% utilization, with worse per-application performance. Our second contribution pertains to designing scalable scheduling techniques that use the information from Paragon and Quasar to perform efficient and QoS-aware resource allocations. We develop Tarcil, a scalable scheduler that reconciles the high quality of sophisticated centralized schedulers with the low latency of distributed sampling-based systems. Tarcil relies on a simple analytical framework to sample resources is a way that provides statistical guarantees on a job meeting its QoS constraints. It incurs a few milliseconds of scheduling overhead, making it appropriate for highly-loaded clusters, servicing both short- and long-running applications. Finally, we design HCloud, a resource provisioning system for public cloud providers. HCloud leverages the information on the resource preferences of applications to determine the type (e.g., reserved versus on-demand) and size of required instances. The system guarantees high application performance, while securing significant cost savings.
Book
1 online resource.
Web services are an integral part of today's society, with billions of people using the Internet regularly. The Internet's popularity is in no small part due to the near-instantaneous access to large amounts of personalized information. Online services such as web search (Bing, Google), social networking (Facebook, Twitter, LinkedIn), online maps (Bing Maps, Google Maps, NavQuest), machine translation (Bing Translate, Google Translate), and webmail (GMail, Outlook.com) are all portals to vast amounts of information, filtered to only show relevant results with sub-second response times. The new services and capabilities enabled by these services are also responsible for huge economic growth. Online services are typically hosted in warehouse-scale computers located in large datacenters. These datacenters are run at a massive scale in order to take advantage of economies of scale. A single datacenter can comprise of 50,000 servers, draw tens of megawatts of power, and cost hundreds of millions to billions of dollars to construct. When considering the large numbers of datacenters worldwide, the total impact of datacenters is quite significant. For instance, the electricity consumed by all datacenters is equivalent to the output of 30 large nuclear power plants. At the same time, demand for additional compute capacity of datacenters is on the rise because of the rapid growth in Internet users and the increase in computational complexity of online services. This dissertation focuses on improving datacenter efficiency in the face of latency-critical online services. There are two major components of this effort. The first is to improve the energy efficiency of datacenters, which will improve the operational expenses of the datacenter and help mitigate the growing environmental footprint of operating datacenters. The first two systems we introduce, autoturbo and PEGASUS, fall under this category. The second efficiency opportunity we pursue is to increase the resource efficiency of datacenters by enabling higher utilization. Higher resource efficiency leads to significantly increased capabilities without increasing the capital expenses of owning a datacenter and is critical to future scaling of datacenter capacity. The third system we describe, Heracles, targets the resource efficiency opportunity for current and future datacenters. There are two avenues of improving energy efficiency that we investigate. We examine methods of improving energy efficiency of servers when they are running at peak load and when they are not. Both cases are important because of diurnal load variations on latency-critical online services that can cause the utilization of servers to vary from idle to full load in a 24 hour period. Latency-critical workloads present a unique set of challenges that have made improving their energy efficiency difficult. Previous approaches in power management have run afoul of the performance sensitivity of latency-critical workloads. Furthermore, latency-critical workloads do not contain sufficient periods of idleness, complicating efforts reduce their power footprint via deep-sleep states. In addition to improving energy efficiency, this dissertation also studies the improvement of resource efficiency. This opportunity takes advantage of the fact that datacenters are chronically run at low utilizations, with an industry average of 10%-50% utilization. Ironically, the low utilization of datacenters is not caused by a lack of work, but rather because of fears of performance interference between different workloads. Large-scale latency-critical workloads exacerbate this problem, as they are typically run on dedicated servers or with greatly exaggerated resource reservations. Thus, high resource efficiency through high utilization is obtained by enabling workloads to co-exist with each other on the same server without causing performance degradation. In this dissertation, we describe three practical systems to improve the efficiency of datacenters. Autoturbo uses machine learning to improve the efficiency of servers running at peak load for a variety of energy efficiency metrics. By intelligently selecting the proper power mode on modern CPUs, autoturbo can improve Energy Delay Product by up to 47%. PEGASUS improves energy efficiency for large-scale latency-critical workloads by using a feedback loop to safely reduce the power consumed by servers at low utilizations. An evaluation of PEGASUS on production Google websearch yields power savings of up to 20% on a full-sized production cluster. Finally, Heracles improves datacenter utilization by performing coordinated resource isolation on servers to ensure that latency-critical workloads will still meet their latency guarantees, enabling other jobs to be co-located on the same server. We tested Heracles on several production Google workloads and demonstrated an average server utilization of 90%, opening up the potential for integer multiple increases in resource and cost efficiency.
Special Collections
Book
1 online resource.
This dissertation challenges widely held assumptions about data replication in cloud storage systems. It demonstrates that existing cloud storage techniques are far from being optimal for guarding against different types of node failure events. The dissertation provides novel methodologies for analyzing node failures and designing non-random replication schemes that offer significantly higher durability than existing techniques, at the same storage cost and cluster performance. Popular cloud storage systems typically replicate their data on random nodes to guard against data loss due to node failures. Node failures can be classified into two categories: independent and correlated node failures. Independent node failures are typically caused by independent server and hardware failures, and occur hundreds of times a year in a cluster of thousands of nodes. Correlated node failures are failures that cause multiple nodes to fail simultaneously and occur a handful of times a year or less. Examples of correlated failures include recovery following a power outage or a large-scale network failure. The conventional wisdom to guard against node failures is to replicate each node's data three times within the same cluster, and also geo-replicate the entire cluster to a separate location to protect against correlated failures. The dissertation shows that random replication within a cluster is almost guaranteed to lose data under common scenarios of correlated node failures. Due to the high fixed cost of each incident of data loss, many data center operators prefer to minimize the frequency of such events at the expense of losing more data in each event. The dissertation introduces Copyset Replication, a novel general-purpose replication technique that significantly reduces the frequency of data loss events within a cluster. It also presents an implementation and evaluation of Copyset Replication on two open source cloud storage systems, HDFS and RAMCloud, and shows it incurs a low overhead on all operations. Such systems require that each node's data be scattered across several nodes for parallel data recovery and access. Copyset Replication presents a near optimal trade-off between the number of nodes on which the data is scattered and the probability of data loss. For example, in a 5000-node RAMCloud cluster under a power outage, Copyset Replication reduces the probability of data loss from 99.99% to 0.15%. For Facebook's HDFS cluster, it reduces the probability from 22.8% to 0.78%. The dissertation also demonstrates that with any replication scheme (including Copyset Replication), using two replicas is sufficient for protecting against independent node failures within a cluster, while using three replicas is inadequate for protecting against correlated node failures. Given that in many storage systems the third or n-th replica was introduced for durability and not for performance, storage systems can change the placement of the last replica to address correlated failures, which are the main vulnerability of cloud storage systems. The dissertation presents Tiered Replication, a replication scheme that splits the cluster into a primary and backup tier. The first two replicas are stored on the primary tier and are used to recover data in the case of independent node failures, while the third replica is stored on the backup tier and is used for correlated failures. The key insight behind Tiered Replication is that, since the third replicas are rarely read, we can place the backup tier on separate physical infrastructure or a remote location without affecting performance. This separation significantly increases the resilience of the storage system to correlated failures and presents a low cost alternative to geo-replication of an entire cluster. In addition, the Tiered Replication algorithm optimally minimizes the probability of data loss under correlated failures. Tiered Replication can be executed incrementally for each cluster change, which allows it to supports dynamic environments where nodes join and leave the cluster, and it facilitates additional data placement constraints required by the storage designer, such as network and rack awareness. Tiered Replication was implemented on HyperDex, an open-source cloud storage system, and the dissertation demonstrates that it incurs a small performance overhead. Tiered Replication improves the cluster-wide MTTF by a factor of 100,000 in comparison to random replication, without increasing the amount of storage.
Special Collections
Book
1 online resource.
Modern mobile devices are marvels of computation. They can encode high-definition video, processing and compressing over 350MB/s of image data in real time. They have no trouble driving displays with as much resolution as a full laptop, and smart-phone manufacturers boast of running games with ``console quality'' graphics. Mobile devices pack all of this computational power into a 1-2W hand-held package by integrating a number of specialized hardware accelerators (IP) along with conventional CPU and GPUs in a system-on-chip (SoC). Unfortunately, creating these specialized systems is becoming increasingly expensive. Since hardware accelerators come from a number of different sources and design cycles, different accelerator blocks will often contain incompatible hardware interfaces. Therefore, a large portion of SoC design cost comes in the form of designers manually interfacing each accelerator into a system. This work includes everything from building custom logic to wire up a block, to developing the drivers and API needed to take advantage of the hardware. My research focuses on generating these interfaces, including the physical hardware used to tie IP blocks into a system and the associated software collateral. Leveraging recent trends such as High Level Synthesis and other hardware ``generator'' methodologies, I propose an IP interface abstraction and parameterization designed to describe the interface of most current IP blocks. By encoding this knowledge at a higher-level of abstraction, I am able to construct and demonstrate a hardware generator that maps an interface protocol description into synthesizable register transfer language (RTL), and that can automatically create hardware bridges between different interconnect standards. To ease the integration of the next generation of IP blocks---blocks that are automatically generated based off of user specification---I propose a set of interface primitives. When integrated into an IP generator, these primitives can automatically generate an interface that my interface system can tie to the rest of the system. I also demonstrate how the information stored in these types of primitives can be used to automatically generate a low-level software driver that manages access to the IP blocks. Finally, I show how the simulation environment provided with an IP generator can be used to provide a domain appropriate application programming interface (API) to drive the software. Using an image signal processor generator as my platform, I demonstrate the construction of a map between the simulation software and hardware driver that enables a full one-button flow from algorithm development to applications running on specialized hardware within a working system.
Special Collections
Book
1 online resource.
Datacenters are critical assets for today's Internet giants (Google, Facebook, Microsoft, Amazon, etc.). They host extraordinary amounts of data, serve requests for millions of users, generate untold profits for their operators, and have enabled new modes of communication, socialization, and organization for society at-large. Their steady growth in scale and capability has allowed datacenter operators to continually expand the reach and benefit of their services. The motivation for the work presented in this dissertation stems from a simple premise: future scaling of datacenter capability depends upon improvements to server power-efficiency. That is, without improvements to power-efficiency, datacenter operators will soon face a limit to the utility and capability of existing facilities that might result in (1) a sudden boom in datacenter construction, (2) a rapid increase in the cost to operate existing datacenters, or (3) stagnation in the growth of datacenter capability. This limit is akin to the "Power Wall" that the CPU industry has been grappling with for nearly a decade now. Like the CPU "Power Wall, " the problem is not that we can't build datacenters with more capability in the future per se. Rather, it will become increasingly difficult to do this economically; even now, the cost to provision power and cooling infrastructure is a substantial fraction of datacenter "Total Cost of Ownership." The root cause of this problem is the recent failure of Dennard scaling for semiconductor technologies smaller than 90 nanometers. Even though Moore's Law continues to march on at a steady pace, granting us exponential growth in transistor count in new processors, we can no longer make full use of that transistor count without also increasing the power density of those new processors. Thus, in order for datacenter operators to sustain the rate of growth in capability that they have come to expect, they must either provision new power and cooling infrastructure to support future servers (at exceptional cost), or find other ways to improve the power-efficiency of datacenters that do not depend on semiconductor technology scaling. Indeed, the initial onset of this problem led to rapid improvements to the efficiency of power delivery and cooling within datacenters, reducing non-server power consumption by an order of magnitude. Unfortunately, those improvements were essentially one-time benefits and have now been exhausted. In this dissertation, we show that most of the future opportunity to improve datacenter power-efficiency lies in improving the power-efficiency of the servers themselves, as most of the inefficiency in the rest of a datacenter has largely been eliminated. Then, we explore four compelling opportunities to improve server power-efficiency: two hardware proposals that explicitly reduce the power consumption of servers, and two software proposals that improve the power-efficiency of servers operating as a cluster. First, we present Multicore DIMM (MCDIMM), a modification to the architecture of traditional DDRx main memory modules optimized for energy-efficiency. MCDIMM modules divide the wide, 64-bit rank interface presented by ordinary DIMMs into smaller rank subsets. By accessing rank subsets individually, fewer DRAM chips are activated per column access (i.e. cache-line refill), which greatly reduces dynamic energy consumption. Additionally, we describe an energy-efficient implementation of error-correction codes for MCDIMMs, as well as "chipkill" reliability, which tolerates the failure of entire DRAM devices. For ordinary server configurations and across a wide range of benchmarks, we estimate more than 20% average savings in memory dynamic power consumption, though the impact on total system power consumption is more modest. We also describe additional, unexpected performance and static-power consumption benefits from rank subsetting. Second, we propose an architecture for per-core power gating (PCPG) of multicore processors, where the power supply for individual CPU cores can be cut entirely. We propose that servers running at low to moderate utilization, as is common in datacenters, could operate with some of their cores gated off. Gating the power to a core eliminates its static power consumption, but requires flushing its caches and precludes using the core to execute workloads until power is restored. In our proposal, we improve the utility of PCPG by coordinating power gating actions with the operating system, migrating workloads off of gated cores onto active cores. This is in contrast to contemporary industry implementations of PCPG that gate cores reactively. We control the gating of cores with a dynamic power manager which continually monitors CPU utilization. Our OS-integrated approach to PCPG maximizes the opportunities available to utilize PCPG relative to OS-agnostic approaches, and protects applications from incurring the latency of waking a sleeping core. We show that PCPG is beneficial for datacenter workloads, and that it can reduce CPU power consumption by up to 40% for underutilized systems with minimal impact on performance. The preceding hardware proposals seek to improve the power-efficiency of individual servers directly. The improvements are modest, however, as there are many factors that contribute to the inefficiency of servers (i.e. cooling fans, spinning disks, power regulator inefficiency, etc.). More to the point, techniques like PCPG only address the power-inefficiency of underutilized CPUs, and do little to address the inefficiency of the rest of the components within a server when it is at low utilization. To address this short-coming, this dissertation then explores a different tack and holistically assesses how utilization across clusters of servers can be manipulated to improve power-efficiency. First, we describe how contemporary distributed storage systems, such as Hadoop's Distributed File System (HDFS), expect the perpetual availability of the vast majority of servers in a cluster. This artificial expectation prevents the use of low-power modes in servers; we cannot trivially turn servers off or put them into a standby mode without the storage system assuming the server has failed. Consequently, even if such a cluster is grossly underutilized, we cannot disable servers in order to reduce its aggregate power consumption. Thus, these clusters tend to be tragically power-inefficient at low utilization. We propose a simple set of modifications to HDFS to rectify this problem, and show that these storage systems can be built to be power-proportional. We find that running Hadoop clusters in fractional configurations can save between 9% and 50% of energy consumption, and that there is a trade-off between performance energy consumption. Finally, we set out to determine why datacenter operators chronically underutilize servers which host latency-sensitive workloads. Using memcached as a canonical latency-sensitive workload, we demonstrate that latency-sensitive workloads suffer substantial degradation in quality-of-service (QoS) when co-located with other datacenter workloads. This encourages operators to be cautious when provisioning or co-locating services across large clusters, and this ultimately manifests as the low server utilization we see ubiquitously in datacenters. However, we find that these QoS problems typically manifest in a limited number of ways: as increases in queuing delay, scheduling delay, or load imbalance of the latency-sensitive workload. We evaluate several techniques, including interference-aware provisioning and replacing Linux's CPU scheduler with a scheduler previous proposed in the literature, to ameliorate QoS problems when co-locating memcached with other workloads. We ultimately show that good QoS for latency-sensitive applications can indeed be maintained while still running these servers at high utilization. Judicious application of these techniques can greatly improve server power-efficiency, and raise a datacenter's effective throughput per TCO dollar by up to 53%. All told, we have found that there exists considerable opportunity to improve the power-efficiency of datacenters despite the failure of Dennard scaling. The techniques presented in this dissertation are largely orthogonal, and may be combined. Through judicious focus on server power-efficiency, we can stave off stagnation in the growth of online services or an explosion of datacenter construction, at least for a time.
Special Collections

Articles+

Journal articles, e-books, & other e-resources
Articles+ results include