Design and optimization of a stencil engine [electronic resource]
- John S. Brunhaver, II.
- Physical description
- 1 online resource.
Also available at
At the library
Researchers can request to view these materials in the Special Collections Reading Room. Request materials at least 2 business days in advance. Maximum 5 items per day.
|3781 2015 B||In-library use|
- Brunhaver, John S., II.
- Horowitz, Mark, primary advisor.
- Kozyrakis, Christoforos, 1974- advisor.
- Olukotun, Oyekunle Ayinde advisor.
- Stanford University. Department of Electrical Engineering.
- Application specific processors exploit the structure of algorithms to reduce energy costs and increase performance. These kinds of optimizations have become more and more important as the historical trends in technology scaling and energy scaling have slowed or stopped. Image processing and computer image understanding algorithms contain the kinds of embarrassingly parallel structures that application specific processors can exploit. Further, these algorithms have very high compute demands, which makes efficient computation critical. So these specialized processors are found on many SoCs today. Yet, these image processors are hard to design and program, which slows architectural innovation. To address this issue we leverage the fact that most image applications can be composed as a set of ``stencil'' kernels and then provide a virtual machine model for stencil computation onto which many applications in the domains of image signal processing, computational photography, and computer vision can be mapped. Stencil kernels are a class of function (e.g. convolution) in which a given pixel within an output image is calculated from a fixed-size sliding window of pixels in its corresponding input image. This fixed window in the input data, where each data element is reused between concurrent computations, allows for a significant reduction in memory traffic through buffering and provides much of the efficiency in specialized image processors. Additionally, the predictable data flow for stencil kernels, allows for the producer consumer relationships between stencil kernels in large applications to be statically determined and exploited, further reducing memory traffic. Finally, the functional nature of the computation and the significant number of times it is invoked allows for the implementation of the computation to be highly optimized. Stencil kernels play a recurring role in image signal processing, computer vision, and computational photography. Any process that creates a filter, constructs low level image features, evaluates relationships of nearby pixels or features, etc. is implementable as a stencil kernel. Many applications in the domain image processing and understanding are built by cascading these operations (e.g. filtering noise, looking for local features and local segments, then localizing regions and objects from those segments and features). These applications also play a significant role in society, whether it is to automate the home, car, or factory or to improve the capabilities of our mobile devices in capturing and understanding the world around us. While the computation model may seem restrictive and domain specific any improvement in the efficiency of this computation for this domain would permeate many fields and society increasing the capability and decrease the cost of innovation and progress. When applications are written in a domain specific language restricted to stencil computation, it can be compiled to the stencil virtual machine model proposed in this thesis. This model allows for an application's behavior to be specified without knowledge of the underlying system implementation. Conversely, such a model allows for a great degree of flexibility in the implementation of that underlying system, which provides opportunity for optimization. The input to this virtual machine model is an intermediate language called Data Path Description Assembler (DPDA), which represents a compiler target for high level languages. While many hardware-software systems implement the virtual machine and execute DPDA, this thesis presents a method to generate fixed function hardware from DPDA code. The resulting hardware is two orders of magnitude more efficient than a comparable CPU or GPU implementation. This hardware generator greatly reduces cost of designing customized engine for new imaging applications, and also serves as a critical reference for research exploring the overheads of more flexible compute engines.
- Publication date
- Submitted to the Department of Electrical Engineering.
- Thesis (Ph.D.)--Stanford University, 2015.
Browse related items
Start at call number: