Wei-Yu Chen (2007)
Optimizing Partitioned Global Address Space Programs for Cluster Architectures
PhD thesis, University of California-Berkeley, Computer Science Division.
Unified Parallel C (UPC) is an example of a partitioned global address space language for high performance parallel computing. This programming model enables application to be written in a shared memory style, but still allows the programmer to control data layout and the assignment of work to processors. An open question is whether programs written in simple style, with fine-grained accesses and blocking communication, can achieve performance approaching that of hand-optimized code, especially for cluster environments with high network latencies.
This dissertation proposes an optimization framework for UPC that automates the transformations currently performed manually by programmers. The goals of the optimizations are twofold: not only do we seek to aggregate fine-grained remote accesses to reduce the number and volume of message traffic, but we also want to overlap communication with computation to hide network latency. The framework first applies communication vectorization and strip-mining to optimize regular array accesses in loops. For irregular fine grained accesses, we apply a partial redundancy elimination framework that also generates split-phase communication. The last phase targets the blocking bulk transfers in the program by utilizing runtime support to automatically schedule them and achieve overlap. Message aggregation is performed as part of the scheduling to further reduce the communication overhead. Finally, we present several techniques for optimizing the serial performance of a UPC program, in order to reduce both the overhead of UPC-specific constructs and our source-to-source translation.
The optimization framework has been implemented in the Berkeley UPC compiler, and has been tested on a number of supercomputer clusters. The optimizations are validated on a variety of benchmarks exhibiting different communication patterns, from bulk synchronous benchmarks to dynamic shared data structure code. Experimental results reveal that our framework offers comparable performance to aggressive manual optimization, and can achieve significant speedup compared to the fine-grained and blocking communication code that programmers find much easier to implement. Our framework is completely transparent to the user, and therefore improves productivity by freeing programmers from the details of communication management.