CScADS Research Overview
- Conduct research leading to the design and construction of software tools and systems to help applications scale to the petascale and beyond
- Focus on DOE Leadership Class Facilities and parallel systems composed of multicore processors
- Promote application-driven software systems research
- Promote research collaborations with DOE (Oak Ridge, Lawrence Berkeley, Argonne, PERI, APDEC, TASCS, and others), NSF (Teragrid, Cyberinfrastructure), and industry (systems and software vendors)
- Catalyze activities within the computer science community that will lead to visionary new ideas for application development support software
- Focus on interactions with systems vendors, application developers, and library designers
- Promote community vision building through summer workshops
Foster development of new tools by the computer science community through support of common software infrastructure and standards
Research and Development of Software for Leadership Computing
The CScADS research program focuses on strategies for improving the productivity of application developers for developing high-performance codes for leadership computing platforms. Rather than attacking a narrow range of problems within this space, CScADS is exploring a vertically-integrated strategy that spans the entire software stack for petascale systems. Work within the center includes system software for leadership computing platforms, communication libraries, math libraries, open source compilers, performance tool infrastructure, and performance tools. In addition to the software research here, CScADS is involved in application outreach through the "Tiger Teams" of the Performance Engineering Research Institute in addition direct involvement with application teams.
Here, we briefly describe the work in each of these areas.
System Software for Leadership Computing
Our goal is to provide a scalable open-source system software stack for massively parallel supercomputers such as the IBM Blue Gene, Cray XT, and SiCortex, so that these machines can become more responsive to the needs of applications. The system software we focus on includes both the compute node operating system and the I/O communication layer. We intend to demonstrate that a completely open-source system software stack can be provided without impacting, and even improving, application performance, compared to proprietary system software typically provided by hardware vendors. Another benefit of having open-source system software is that it facilitates the infusion of system software research into production systems. Finally, applications benefit when system software problems can be fixed locally and quickly and cooperatively with the involvement of both users and support personnel with access to the source code of the system.
So far we have successfully replaced the Blue Gene/P compute node operating system and I/O layer with our software stack. One of our collaborators, the Falkon group at the University of Chicago, successfully ported their job submission framework to our software stack. The IBM system software stack does not support Falkon well, so our software stack has in effect made it possible for Blue Gene to be employed as a High Throughput Computing (HTC) system. Using Falkon, the University of Chicago team ran DOCK5 (a high-throughput application that can help screen for potentially beneficial drugs) and demonstrated scalability of their system out to 32K nodes. In one experiment, almost one million individual DOCK tasks were completed in just over two hours on 118K CPU cores of IBM Blue Gene/P at Argonne, with a sustained resource utilization of 99.6%. This was achieved thanks to advances in I/O forwarding, caching, and aggregation, which would not have been possible without our flexible and scalable system software stack.
In addition to support for HTC on Blue Gene/P, we have been working on support for High Performance Computing (HPC) as well. We developed support for using Blue Gene/P's collective and torus networks. The DMA device for the torus network, which is used for point to point communication, lacks a gather/scatter capability and thus requires a physically contiguous memory region. To address this issue, we developed a new memory subsystem for Linux, which we call Big Memory. It transparently provides user applications with a physically contiguous memory region as needed by the DMA engine. That region is also covered by extremely large, semi-static TLB entries, greatly reducing the overheads normally associated with paged memory architecture. Big Memory not only enables the network to function, but it also brings application performance to the level normally only achieved by the proprietary IBM compute node kernel. Big Memory has proven attractive beyond its intended use on the compute nodes: it is now in active use on the I/O nodes at ASTRON's Blue Gene/P in the Netherlands, used by the LOFAR radio telescope. It reduced the CPU resources required for online processing on I/O nodes by 500-600%.
We have successfully run the open-source MPICH implementation on our system software stack and the NAS parallel benchmarks showed that there was negligible performance penalty compared to IBM's dedicated compute node operating system. We have also been able to run Berkeley's Unified Parallel C (UPC) on this system software stack.
Partitioned Global Address Space Languages
The Partitioned Global Address Space (PGAS) model, exemplified by the UPC, Co-Array Fortran and Titanium Languages allows programmers to easily express parallelism on complex shared data structures. The language allows such structures to be accessed through global pointers and distributed array expressions, as well as bulk operations based on either high level array copies or (in UPC) explicit memory copies. PGAS programs that are designed and optimized for clusters do most of their communication using bulk operations, but programs written for shared memory hardware often have a fine-grained style. Fine-grained accesses that occur in loops may be amenable to message vectorization, where accesses are combined across iterations, but more irregular communication patterns are usually not amenable to such loop-based optimizations since they either use pointer dereferences or have dynamic access patterns (e.g., table lookup). Instead, compiler algorithms that decrease the number, reduce the volume, and hide the latencies of the message traffic for irregular applications can be very beneficial.
PGAS Languages for Multicore Systems to Petascale Systems
Dual and quad core processors are currently the dominant building block for high end systems, and the number of cores is likely to double with chip density over the next few generations. At the same time, both memory (commodity DRAM) density and off-chip bandwidth may grow at a slower pace, making it desirable to allow sharing of user level data structures between cores on a chip. PGAS languages take advantage of the shared memory and avoid some of the memory footprint costs associated with partitioned address space (message passing) programming models. The Berkeley UPC compiler currently runs on multicore systems and clusters of multicore nodes but the group is exploring a number of extensions to the language, compiler, and runtime system to make effective use of multicore nodes. Under the CScADS project, the group has applied autotuning techniques to the problem of building a highly optimized collective communication library for PGAS languages. Collective communication is critical to the performance of many bulk-synchronous algorithms, whether they are programmed in MPI, UPC, CAF, or one of the languages emerging from the HPCS program. The Berkeley group specifically looked at optimization techniques for the UPC collectives and studied optimizations for fairly different multicore architectures, the Intel Clovertown and Sun Niagra2. We developed highly optimized and scalable collective implementations for shared memory and found that distributed memory algorithms such as trees are often useful as the core count grows. The choice of tree structure and communication is highly dependent on the machine size, collective routine, and data size, so they developed a prototype autotuning framework to automatically select optimized implementations.
UPC has a novel set of synchronization modes associate with its collective operations to allow a set of collectives to be pipelined and overlapped with other work if desired. This can result in some semantic surprises as users may modify the address space locally and remotely while a collective is in progress, and this includes data involved in the collective. To enable performance for sophisticated programmers while giving the simplest behavior for beginning programmers, UPC offers relaxed and strict synchronization modes on collectives. The performance implications of these were not well understood, so the Berkeley team did a number of experiments comparing multiple implementations of the two different semantics. Experiments showed that relaxed tree-based collectives give the best performance on a large multicore machine, although flat (one to many or many to one) implementations are sometimes better on smaller systems.
The Berkeley group is also developing an optimized implementation of the basic GASNet communication layer for Petascale systems such as the BlueGene architecture, which has previously been supported only by an MPI implementation of GASnet, which is not very efficient, and by an IBM Research prototype, which is not available outside IBM. GASnet underlies multiple PGAS language compilers (Berkeley UPC, Intrepid gcc/upc, Rice CAF, Berkeley Titanium, and Cray Chapel). Work to understand the performance and semantic differences between MPI and GASNet is an ongoing area of activity across the CScADS project, and one that helps improve both MPI and GASNet.
Co-array Fortran Language and Implementation
In a partnership with the Center for Programming Models for Scalable Parallel Computing, we are working to develop a second-generation compiler for Coarray Fortran. To enable CScADS to deliver a more robust source-to-source Coarray Fortran compiler that could be used by application scientists, we are working with LLNL’s Rose compiler team to develop software infrastructure to support a source-to-source Coarray Fortran compiler based on Rose. The focus of CScADS efforts in this area have been to work with LLNL to complete Rose’s language support for Fortran. As part of this work, we have added support for Fortran modules to support separate compilation of Fortran programs. This work included synthesis of module interface specifications, adding support to import interfaces from external modules, and creating regression tests for Rose’s Fortran support.
At present, we are in the process of incrementally adding Coarray Features to LANL’s Java-based OpenFortran parser, which is used as the Fortran front-end by Rose, along with support for source-to-code generation in Fortran 90.
Numerical Libraries for Multicore Parallel Systems
Tennessee and UC Berkeley have been collaborating on re-engineering numerical libraries for parallel systems. In this work, a case study of matrix factorizations is studied to explore a new software model for future HPC systems. This work explores the use of multithreading to tolerate synchronization latency in the context of matrix factorization. The model relies on dynamic, dataflow-driven execution models and avoids both global synchronization and the implicit point-to-point synchronization of send/receive style message passing. Highly asynchronous codes are a good fit for the massive amount of concurrency present in multicore machines. Our prototype codes successfully managed to hide algorithmic and communication latencies and so deliver high performance. They are especially advantageous on smaller problem sizes and larger degree parallelism, because they avoid some of the overheads of the traditional bulk-synchronous models. We intend to further explore this programming paradigm for two-sided linear algebra algorithms (e.g., eigenvalue problems) and sparse matrix algorithms, where scalability is even more challenging and the avoidance of synchronization costs should have an even higher payoff.
The Berkeley team also made progress in delivering self-tuning libraries to the user community in the form of a multicore/SMP extension of the OSKI sparse matrix library called pOSKI. Whereas OSKi tunes for register, caches, and some SIMD accelerators, pOSKI also tunes for the number of threads and adds thread count and blocking as well as explicit software prefetch. The ideas build on work by Sam Williams on optimizations for multicore, which was funded in part by the PERI SciDAC project; CScADS has supported the implementation of these optimizations in the OSKI autotuning framework to make it easier for users to benefit from the ideas.
Compiler Technology for High Performance Computing
Within CScADS, research on compiler technology includes activities that range from improvements to compiler optimization for loop-based scientific computations, source-to-source compilation for tuning memory hierarchy performance in scientific codes, to analysis and optimization of scripting languages. We briefly summarize work on these activities
Source-to-source optimization for improving memory hierarchy performance. Over the last several years, Rice has been developing LoopTool - a compiler-based tool that helps expert human programmers improve the performance of Fortran loop nests by applying complex patterns of transformations to tailor the loop nests for a target microprocessor. LoopTool automates the application of well-known source-to-source transformations that improve data reuse at various levels of the memory hierarchy, adjust instruction mix, and generate code that can be scheduled more efficiently by a conventional Fortran compiler. LoopTool has been used to improve the performance of memory intensive loop nests in S3D. Looptool-generated code is now part of the production version of S3D.
To use LoopTool, one takes a Fortran procedure and annotates the code with directives that specify a transformation recipe. LoopTool then applies the recipe to perform both the explicitly-specified transformations along with other supporting transformations that need not be specified explicitly. In FY08, Texas State University has partnered with Rice to create a hardened Linux version of LoopTool that can be distributed as open source software that can be used by application teams to help address performance problems identified in their codes. This work is nearing completion and a version of LoopTool should distributable as open source software in 2009.
Other efforts include:
- Dynamic optimization of complex applications. Sophisticated scientific applications are often assembled out of a diverse set of components constructed by different software teams. The Common Component Architecture (CCA) was devised to aid in assembling such applications. A problem with component-based approaches to software is that abstraction boundaries between components can be costly. Today, avoiding excessive costs at component boundaries forces developers to write coarser-grain components than they might like. CScADS has been exploring the use of dynamic program optimization for CCA applications, namely analyzing and optimizing software at run time when all component interfaces are bound and full information about the bindings is available. To explore the potential for optimizations at this level, we have been experimenting with interprocedural optimization of SIDL and Babel code, which serves as the interface between components in CCA applications, using the LLVM compiler infrastructure. Tests using the TTSTT mesh benchmarks from the Center for Component Technologies in Terascale Simulation Software have shown the promise of this approach. With interprocedural optimization, we have been able to reduce the overhead of fine-grain operations from 5.5x to 3.5x. Much of the remaining overhead is due to the cost of allocating temporary objects. Experiments with a fast allocator dropped the overhead to 2.5x. Further work is needed to improve upon our preliminary work and turn out a tool that is capable of performing such optimization for production codes.
- Compiler optimizations for stencil calculations Rice has developed a novel technique for detecting redundant computations where the redundancy occurs on different iterations of a loop under different names. These redundancies arise routinely in array address calculations and computations on array-element values (such as stencil or wave front techniques). We developed a code transformation based on algebraic re-association, which using commutativity and associativity to rearrange expressions in ways that expose additional opportunities for optimization.
- Analysis and optimization of scripting languages. The difficulty of developing sophisticated high performance parallel applications is well known. This DOE Office of Science, the NNSA, and the NSA have been struggling with aspects of this problem for years. Better software technologies to accelerate development of high performance applications for leadership computing platforms would be welcomed by the application teams. At the FY07 CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications, application teams indicated that they would prefer to program in high-level scripting languages. The popularity of scripting languages among developers of scientific application has been growing year after year. To support this goal, CScADS is exploring compiler and run-time technology that will enable scripting languages to be used for high performance computing.
Performance Tools for Scalable Parallel Systems
CScADS work on performance tools consists of three parts: partial support for Rice’s work on the HPCToolkit performance tools for leadership-class machines, joint work by Rice and Wisconsin on development of a common infrastructure for performance tools, and partial support of the Jumpshot performance visualization system. Here, we provide an overview of work on these efforts.
As part of CScADS research and development, Rice and Wisconsin have been leading the development of a series of performance-tool components. These components will become part of a community infrastructure for performance tools. By creating a shared tools infrastructure, we hope to accelerate the development of sophisticated performance tools for leadership computing platforms. At Wisconsin, a key focus of this work has been on the “deconstruction of Dyninst” – that is, the identification of key abstract components, design of interfaces for these components, producing libraries that isolate the new components, and then restructuring Dyninst to use these new libraries. The result of such an effort is to allow tool builders more flexibility in reusing and sharing tool technology and reducing re-implementations. At Rice, the focus of this work has included componentization of existing aspects of the HPCToolkit performance tools, as well as development of new tool capabilities with an eye toward components. To date, efforts focused on components for several purposes including:
- InstructionAPI. Wisconsin has completed the first version of the InstructionAPI, the part of binary code parsing that disassembles machine code and provides an abstract representation of each instruction. This interface provides a foundation for performance tools that interact with programs at the binary level.
- Stack unwinding. To understand the context in which applications incur performance costs, both Rice’s HPCToolkit and Wisconsin’s Dyninst unwind the call stack of an executing program. Currently, unwinders for both Cray XT and Blue Gene/P architectures are operational with high accuracy. Eventually, this work will be incorporated into the StackUnwindAPI – an application programming interface and library for call stack unwinding, which is under development at Wisconsin.
- Dyninst. Wisconsin ported Dyninst (and all related libraries) to add 64-bit support for Power/AIX and is in the last stages of the PowerPC/Linux port. Wisconsin also developed a general framework for integrated support of 32 and 64 bit platforms. This integrated support includes the ability to run on 32-bit kernels with 32-bit applications and on 64-bit kernels with 32- and 64-bit applications.
- libmonitor. libmonitor is a library that provides a layer upon which to build debugging and performance tools. libmonitor intercepts key operations in a program execution, including begin and end of processes, begin and end of threads, fork, exec, MPI initialization and finalization, dynamic library loading, and signal handling.
Sampling-based Performance Tools for DOE Computing Platforms
The HPCToolkit project at Rice University is focused on the design and implementation of techniques for providing accurate fine-grain measurements of production applications running at scale.
Call path profiling
For tools to be useful on production applications at scale, large measurement overhead is unacceptable. For measurements to be accurate, performance tools must avoid introducing measurement error.
To address these challenges, HPCToolkit avoids instrumentation and favors the use of statistical sampling to measure and attribute performance metrics. During a program execution, sample events are triggered by periodic interrupts induced by an interval timer or overflow of hardware performance counters. One can sample metrics that reflect work (e.g., instructions, floating-point operations), consumption of resources (e.g., cycles, memory bus transactions), or inefficiency (e.g., stall cycles).
For all but the most trivially structured programs, it is important to associate the costs incurred by each procedure with the contexts in which the procedure is called. Knowing the context in which costs are incurred is essential for understanding performance. This is particularly important for code based on application frameworks and libraries. For instance, costs incurred for calls to communication primitives (e.g., MPI_Wait) or code that results from instantiating C++ templates for data structures can vary widely depending how they are used in a particular context. Because there are often layered implementations within applications and libraries, it is insufficient either to insert instrumentation at any one level or to distinguish costs based only upon the immediate caller. For this reason, HPCToolkit supports call path profiling to attribute costs to the full calling contexts in which they are incurred.
Delivering sampling-based performance tools for all DOE computing platforms, especially the leadership-class systems, has been a principal goal of Rice’s HPCToolkit project. During FY07 and FY08, CScADS worked to engage the tools community (including developers of hardware performance counter device drivers, kernel developers at Cray and IBM, and the PAPI team at Tennessee) to bring these capabilities to life. As of the spring of 2009, sampling based performance monitoring is available on both Cray's Compute Node Linux, and IBM's Compute Node Kernel.
Recovering static program structure
To enable effective analysis, measurements of fully optimized programs must be correlated with important source code abstractions. Since measurements are made with reference to executables, it is necessary to map measurements back to the program source. To associate sample-based performance measurements with the static structure of fully-optimized executables, we need a mapping between object code and its associated source code structure. HPCToolkit constructs this mapping using binary analysis; we call this process "recovering program structure".
HPCToolkit focuses its efforts on recovering procedures and loop nests, the most important elements of source code structure. To recover program structure, HPCToolkit's hpcstruct utility parses a load module's machine instructions, reconstructs a control flow graph, combines line map information with interval analysis on the control flow graph in a way that enables it to identify transformations to procedures such as inlining and account for transformations to loops.
Two important benefits naturally accrue from this approach. First, HPCToolkit can expose the structure of and assign metrics to what is actually executed, even if source code is unavailable. For example, hpcstruct's program structure naturally reveals transformations such as loop fusion and scalarized loops that arise from compilation of Fortran 90 array notation. Similarly, it exposes calls to compiler support routines and wait loops in communication libraries of which one would otherwise be unaware. Second, we combine (post-mortem) the recovered static program structure with dynamic call paths to expose inlined frames and loop nests. This enables us to attribute the performance of samples in their full static and dynamic context and correlate it with source code.
Pinpointing scaling losses
To pinpoint and quantify scalability bottlenecks in context, we use HPCToolkit to compute a metric that quantifies scaling loss by scaling and differencing call path profiles from a pair of executions. Consider two parallel executions of an application, one executed on p processors and the second executed on q > p processors. In a weak scaling scenario, processors in each execution compute on the same size data. If the application exhibits perfect weak scaling, then the execution times should be identical on both q and p processors. In fact, if every part of the application scales uniformly, then this equality should hold in each scope of the application.
By differencing a pair of call path profiles, we can use this technique to pinpoint and quantify where parallel overhead has arisen when the execution is scaled. This analysis technique can be applied to weak scaling or strong scaling, within or across nodes in a cluster.
Our analysis is able to distinguish between different causes for performance losses. For example, an analysis using standard time-based sampling is sufficient to precisely distinguish MPI communication bottlenecks from computational bottlenecks. With hardware performance counters, one can distinguish between different architectural bottlenecks such as floating point pipeline stalls, memory bandwidth, or memory latency.
A User Interface for analyzing performance data
Rice built the hpcviewer user interface for the HPCToolkit performance tools to support analysis of call path profiles of parallel programs. Call path profiles are particularly useful for identifying bottlenecks in large, multi-lingual, layered, modular codes.
Hpcviewer's browser window is divided into three panes: the navigation pane, the source pane, and the metrics pane. The navigation pane enables one to rapidly explore performance metrics attributed to lines, loops, procedures, calling contexts, and inlined code. hpcviewer supports three principal views of an application's performance data: a top-down calling context view, a bottom-up callers view, and a flat view. The calling context view associates costs with the dynamic calling contexts (call paths) in which they were incurred. The caller’s view enables one to look upward along call paths. It is particularly useful for understanding how costs are incurred by software components or procedures that are used in more than one context, e.g. MPI_Wait. Last but not least, the flat view organizes performance measurement data according to the static structure of an application. All costs incurred in any calling context by a procedure are aggregated together. The flat view enables one to compare the distribution of costs among scopes program wide and examine the attribution of costs to individual source lines and loops within a scope. hpcviewer also supports a variety of features for aiding analysis, such as support for synthesizing new metrics, ranking scopes by metrics, and automatically identifying important call paths through code.
Understanding the temporal evolution of parallel program executions
In FY08, Rice developed hpctraceview - a prototype visualization tool for HPCToolkit for examining space-time diagrams of execution traces of parallel programs. Although space-time visualizations of traces have been used by other tools, the nature of hpctraceview's visualizations and the data upon which they are based is rather different.
Other tools for rendering execution traces of parallel programs rely on embedded program instrumentation that synchronously records information about the entry and exit of program procedures, communication operations, and/or program phase markers. Each time line in hpctraceview represents a sequence of asynchronous samples taken over the life of a thread or process. Each sample for a thread represents the entire stack of the thread's active procedures at the instant when the sample event occurred. Each color on the time lines represents a different procedure. Unlike other trace visualizers, hpctraceview's visualizations are hierarchical. Since each sample in each thread timeline represents a call stack, we can view the thread timelines at different call stack depths. The prototype implementation of hpctraceview has shown that visualizing traces of call stack samples for parallel programs yields insight into temporal behavior of programs. We will explore how to scale this tool out to handle thousands of processes for long running executions. This involves addressing two challenges: managing out of core trace data (full traces for thousands of processes will be too large to represent in memory at once), and developing a useful compressive mappings so that one can analyze data from more processes than lines in a user’s display. Since the data in the time dimension is already sampled, it seems reasonable to consider displaying sampled process information along the vertical axis as well.
Trace-based Monitoring and Analysis of MPI Programs
Argonne enhanced their MPI performance analysis tools to better meet the challenges of leadership computing platforms based on multicore processors. While initial ports of DOE applications to these machines typically use these cores to run separate MPI processes on each core (a situation that a number of performance analysis tools can handle), obtaining a substantial fraction peak performance will require explicit multithreading by applications either with the Pthreads API or with OpenMP compilers.
Argonne modified MPI’s MPE logging mechanisms to capture logs from separate threads and display them in Jumpshot, Argonne’s parallel timeline display tool. Jumpshot has also been incorporated into the TAU performance analysis toolkit. Argonne ported the MPE logging library to the IBM BG/P leadership computing platform and tested it with both OpenMP and Pthreads. Beyond the leadership computing platforms, MPE logging and Jumpshot have been tested on the NEC SX-8, Cray X1E, IBM AIX platforms, and Argonne’s SiCortex platform.
Argonne also developed a library (FPMPI2) for collecting statistics on MPI functions, including wait time due to delayed sends and opposed to wait time due to communication latency and bandwidth.