Apan Qasem (2007)
Automatic Tuning of Scientific Applications
PhD thesis, Rice University, Department of Computer Science.
Over the last several decades we have witnessed tremendous change in the landscape of computer architecture. New architectures have emerged at a rapid pace with computing capabilities that have often exceeded our expectations. However, the rapid rate of architectural innovations has also been a source of major concern for the high performance computing community. Each new architecture or even a new model of a given architecture has brought with it new features that have added to the complexity of the target platform. As a result, it has become increasingly difficult to exploit the full potential of modern architectures for complex scientific applications. The gap between the theoretical peak and the actual achievable performance has increased with every step of architectural innovation. As multi-core platforms become more pervasive, this performance gap is likely to increase. To deal with the changing nature of computer architecture and its ever increasing complexity, application developers laboriously retarget code, by hand, which often costs many person-months even for a single application. To address this problem, we developed a software-based strategy that can automatically tune applications to different architectures to deliver portable high-performance.
This dissertation describes our automatic tuning strategy. Our strategy combines architecture-aware cost models with heuristic search to find the most suitable optimization parameters for the target platform. The key contribution of this work is a novel strategy for pruning the search space of transformation parameters. By focusing on architecture-dependent model parameters instead of transformation parameters themselves, we show that we can dramatically reduce the size of the search iii space and yet still achieve most of the benefits of the best tuning possible with exhaustive search. We present an evaluation of our strategy on a set of scientific applications and kernels on several different platforms. The experimental results presented in this dissertation suggest that our approach can produce significant performance improvement on a range of architectures at a cost that is not overly demanding.