Personal tools
You are here: Home Workshops Summer 2009 Abstracts for Performance Tools for Petascale Computing
Document Actions

Abstracts for Performance Tools for Petascale Computing

by John Mellor-Crummey last modified 2009-07-10 07:55
Monday, July 20, 2009
9:00 AM
The Latest and Greatest in the Dyninst Binary Code Toolkit
  Matt Legendre, Bill Williams, Madhavi Krishnan, and Drew Bernat,
University of Wisconsin

We will present a brief overview of deconstructing Dyninst and building independent components. We will talk briefly about the components that we have already developed like SymtabAPI, InstructionAPI, DepgraphAPI and StackwalkerAPI and new components on the horizon like ParsingAPI. We will discuss the current status and goals of the binary rewriting tool, a new addition to the toolkit. We will describe our initial thoughts on components we intend to produce, namely a process control library and an interface for evaluating instruction semantics.

In the second part of the talk, we will describe a new idiom called group file operations, for performing file operations on groups of files on distributed hosts. We will discuss the design of TBON-FS (Tree Based Overlay Network-File System) which is a scalable distributed file system that supports group file operations on thousands of distributed files. We will demonstrate how group file operations were used to create several parallel tools like parallel top (ptop), parallel rsync etc for management of distributed systems.


10:00 AM New Developments around Scalasca
  Felix Wolf,
Juelich Supercomputing Centre

Scalasca is an open-source toolset that can be used to analyze the performance behavior of parallel applications and to identify opportunities for optimization. It has been specifically designed for use on large-scale systems including IBM Blue Gene and Cray XT, but is also well-suited for small- and medium-scale HPC platforms. Scalasca integrates runtime summaries with in-depth studies of concurrent behavior via event tracing, adopting a strategy of successively refined measurement configurations. In this talk, we will discuss new developments around Scalasca including a semantic runtime compression method for time-series call-path profiles and will give updates on SIONlib, a library for efficient parallel task-local I/O, and CUBE, a generic profile browser.


10:30 AM Memory Subsystem Profiling with the Sun Studio Performance Analyzer (with demo)
  Marty Itzkowitz, Sun Microsystems

In this talk, I will delineate why memory subsystem performance is crucial to application performance, and why it is hard to understand. I will briefly review HW counter profiling, and dataspace profiling. Then I will describe new techniques for slicing and dicing the data differently to directly address issues related to the memory subsystem, and give an example of the detection of false-sharing of a cache line. Finally, I will talk about potential future development in the tools.


11:15 AM Linux Has a Generic Performance Monitoring API! (with demo)
    Stephane Eranian, HP

For many years, people have been asking for a generic Linux performance monitoring API to access the hardware performance counters of modern processors. First, there was OProfile, then perfctr, then perfmon. We designed perfmon but late last year, it was eventually rejected by top-level X86 kernel maintainers. They, instead, developed a brand new
interface using a different approach: an event-oriented API. This interface, called Performance Counter for Linux (PCL), has been under heavy development for the last six months. The ABI is fairly stable but only a few processors are currently supported. Nonetheless, the patch was accepted in the upstream kernel for 2.6.31. In this talk, we give an overview of this new interface and compare the pros and cons with perfmon.


2:00 PM Combining static and dynamic analysis for debugging and performance tuning
  Martin Schulz, LLNL

Traditional tools often rely only on static or dynamic analysis. However, to exploit all available information about a program, tools must combine both of these basic techniques. This talk will discuss the opportunities of this approach and demonstrate it using two projects from the areas of debugging and performance tuning. In collaboration with the University of
Wisconsin we use static analysis to extract a logical temporal ordering of dynamic execution contexts and statements directly from source code and then map this data to dynamic location information to assist debugging. In collaboration with the Universities of Linz and Munich we use dynamic pattern detection and static source code analysis to identify and to
verify opportunities to apply transformations for MPI collective operations.


2:30 PM Performance Measurement and Analysis of Multithreaded Programs
    Nathan Tallent, Rice University

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute *parallel idleness*, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute *parallel overhead* --- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using *idleness* and *overhead* metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.


3:30 PM What It Takes To Assign Blame
  Jeff Hollingsworth and Nick Rutar,
University of Maryland

We have been developing a new performance mapping tool called Blame that tracks performance problems through levels of abstraction in complex parallel programming frameworks. In this talk we explain some of the needs we have for tool components. Our tool requires the integration of data from both static and dynamic analysis. In addition we use both source code and binary analysis components. This talk will review why data is needed from these difference sources, and describe our experiences using tool components to gather it.


4:00 PM Performance Strategies for Parallel Mathematical Libraries Based on Historical Knowledgebase
  Eduardo Cesar and Ania Morajko,
Autonomous University of Barcelona

Scientific and mathematical parallel libraries offer a high level of abstraction to programmers. However, it is still difficult to select the proper parameters and algorithms to maximize the application performance. We proposed some strategies for automatically adjusting parameters of applications written with the PETSc library. These strategies are based on historical performance information and data mining techniques. They must be able to automatically identify the structure and significant characteristics of the input data and then compare the output of this recognition process against a comprehensive set of known cases. At the same time, they also have to include a mechanism to distribute the input data to avoid load inbalances.


4:30 PM A Proposal for a Profiling Data Exchange Format
    Bernd Mohr, JSC
Martin Schulz, LLNL
Dan Gunter, LBL
Kevin Huck, University Oregon
Xingfu Wu, Texas A&M

The PERI XML effort aims at defining a standardized, XML based performance data exchange format that will allow measurement and performance analysis tools to share structured profile data. The current version, which was born from discussions during last year's CScADS workshop, then further developed in a long series of phone conferences, represents a complete draft of the PERI XML schema. In this talk, we want to present this new schema to the community to get feedback and to understand possible limitations of the current schema. It is our hope that this will lead to the final version of the schema shortly after the workshop, which can then be implemented in a wide variety of tools.


Tuesday, July 21, 2009
  9:00 AM HPCToolkit Update 2009
  Mike Fagan, Rice University

HPCToolkit development has been moving along at a nice pace. Several new features have been added. In addition, HPCToolkit has been installed and used on some leadership-class machines.

The high points are:

1. Refinements to the libmonitor interface
2. Refinements to unwinder (including validation)
3. Leadership class machine experience:
3.1 Flash (show demo)
3.2 IBM Blue Gene
4. Leadership class machine experience has led us to
create acceptance tests for various OS features.
5. Future
5.1 Plan to explore scionlib
5.2 MPI-hpcprof parallel profile analysis


10:00 AM Performance Measurement and Analysis of Heterogeneous Systems: Tasks and GPU Accelerators
    Allen Malony, University Oregon



10:30 AM Automatic Profiling Analysis
  Luiz Derose and Heidi Poxon, Cray



11:00 AM Obtaining Extremely Detailed Information at Scale
  Jesus Labarta, BSC

We will describe two approaches to obtain very detailed information about the behavior of large parallel programs. The first approach aims at merging information obtained through instrumentation and coarse grain sampling to achieve very detailed information on the evolution of performance metrics with time while keeping the instrumentation overhead to a minimum. The second technique applies clustering techniques at run time to derive very detailed CPI Stack models of a running application at large scale. Both techniques could be combined to extract in a very scalable way a huge amount of information about the behavior of a program while minimizing the actual amount of data emitted.


11:30 AM Building a Community Infrastructure for Scalable On-Line Performance Analysis Tools
    Jim Galarowicz, Krell Institute
David Montoya, LANL

In this talk we will give an overview of a project which was just recently jointly funded by OASCR and the NNSA to create scalable open source performance tool components. The goal of this project is to design, implement, and evaluate a general, flexible tool infrastructure supporting the construction of performance tools as "pipelines" of high-quality interchangeable tool building blocks. These tool building blocks provide common performance tool functionality, and are designed for scalability, lightweight data acquisition and analysis, and interoperability. The key benefit of this project will be the ability to use the highly scalable infrastructure to quickly create tools that match with a machine architecture and a performance problem that needs to be understood. For this project, we will start with the existing Open|SpeedShop code base and decompose its current components into new a more modular and scalable set of components. When combined these components will provide the equivalent Open|SpeedShop functionality, but the individual components can also be used in other tool sets without requiring the Open|SpeedShop structure or design philosophy.

In order for this project to be successful we need to develop a mechanism that can recognize which components are compatible with others, which have dependencies, and which have some sort of constraints. We will present information about our initial research into the modeling of component interfaces to resolve component constraints and dependencies. The goal of this research is to ultimately be able to create a component resolution system that allows a core driver component, driven by the user, to assemble a set of components that fit the users criteria creating a scalable performance tool. We have been brainstorming with a frameworks person from Carnegie Mellon University about the best approach to these and other issues.


2:00 PM Autonomous Tool Infrastructure
  Dorian Arnold, University of New Mexico

This talk will present current and future research in autonomous overlay networks to support high performance, scalable, robust tools. I briefly will overview recent extensions to the MRNet infrastructure made to support fault-tolerance and dynamic topology configurations. I will then discuss our current work in run-time monitoring, modeling and re-configurations that will lead eventually to an infrastructure that autonomously and dynamically adapts itself to address functional failures and performance failures and workload.


2:30 PM Scalable Tool Infrastructure for the Cray XT Using Tree-Based Overlay Networks
    Philip Roth, ORNL

Performance, debugging, and administration tools are critical for the effective use of parallel computing platforms, but traditional tools have failed to overcome several problems that limit their scalability such as communication between a large number of tool processes and the management and processing of the volume of data generated on a large number of compute nodes. A tree-based overlay network has proven effective for overcoming these challenges. In this talk, we present our experiences in bringing the MRNet tree-based overlay network infrastructure to the Cray XT platform, including a description of new MRNet capabilities and integration of MRNet into an existing performance tool on the Cray XT.


  3:30 PM ROSE Open Compiler Infrastructure supporting Custom Tools for Software Analysis, Transformation, and Optimization
    Dan Quinlan, LLNL

ROSE is a tool for building source-to-source transformation tools for the custom analysis, optimization, and transformation of large scale C, UPC, C++ and Fortran (F2003,F95,F90,F77,F66) applications, and also OpenMP. Recent work over the last two years has added binary analysis support to ROSE (specifically support for x86, ARM, and PowerPC instruction sets and both Windows (PE, NE, LE, MSDOS formats) and Linux (Elf format). More recent work has added dynamic analysis support using "Intel Pin" to mix both static and dynamic analysis. ROSE has an external community of users, developers, and collaborators and has been used in a number of research and industry tools (more information is available at It has been a basis for a number of external collaborations in custom analysis and program optimization and we invite new ones.

Specifically, ROSE is packaged to allow the construction of custom tools by a non-compiler audience. Central to ROSE is the analysis and transformation of the Abstract Syntax Tree (as well as other graphs generated from it) and its transformation to generate new code. Research work has addressed a range of topics from optimization (loop optimization, MPI optimization, data structure optimization, automated parallelization using OpenMP, etc.) to the details of macro handling in CPP preprocessing and advanced token handling for extremely precise levels of source-to-source code regeneration and analysis of subtle details typically not possible within conventional compiler infrastructures.

Ongoing research using ROSE has been the basis of our development of an end-to-end tool chain focused on empirical tuning using a wide range of tools from academic and other DOE labs and demonstrating how these work together. Recent work has also demonstrated static analysis tools using ROSE that can be used to check OpenMP usage. Still other work has demonstrated the run-time detection of code for which the results are undefined, or compiler implementation dependent, within the C and C++ language standard. This talk will present a range of uses of ROSE for both performance optimization and a few selected areas to demonstrate the breath of tools that can be built using ROSE.


  3:30 PM Performance Tuning using PMU Features of Core i7
    Ramesh Peri, Intel



  3:30 PM Semi-Automatic Models of Communication Volume and Frequency for SPMD Applications
    Gabriel Marin, ORNL

Network simulators are typically used to understand an application's communication overhead on different distributed memory architectures. While simulators can provide more accurate timing predictions, they involve a higher time and space overhead and they require profiling applications at scale. In this talk I will describe a statistical approach that semi-automatically synthesizes analytical models of communication volume and frequency parameterized by the number of processors or problem size, and can provide insight into scalability bottlenecks.



This workshop is sponsored by the Center for Scalable Application Development Software, with funding from the Scientific Discovery through Advanced Computing (SciDAC) program.

« April 2018 »
Su Mo Tu We Th Fr Sa

Powered by Plone

CScADS Collaborators include:

Rice University ANL UCB UTK WISC