Brim-abstract
Improving the Scalability of the TotalView Debugger using TBON-FS and proc++
Michael Brim (University of Wisconsin)
John DelSignore (Rogue Wave Software)
ABSTRACT
A common requirement among various tools and middleware is performing process control and inspection on a distributed group of processes. In prior work, we introduced group file operations, a simple, intuitive interface to scalable group operations on distributed files that can be easily adopted by existing tools and middleware or used to create new scalable tools. Group file operations avoid the linear costs typically associated with dealing with a large file space by extending existing file system abstractions and operations with group semantics that eliminate iterative access. We also developed the TBON-FS distributed file system that leverages a tree-based overlay network to provide scalable group operations on files from thousands of independent file servers. Recently, we have developed proc++, a new synthetic file system that provides control and inspection of groups of processes and threads. In this talk, we report on our ongoing effort to use group file operations, TBON-FS, and proc++ to improve the scalability of TotalView, the most widely used commercial debugger for HPC systems. We report the performance benefits achieved when using the modified debugger on parallel applications with up to 49,152 processes on a Cray XT5 system, and discuss our observations on building tools that can scale for use on upcoming systems containing millions of processor cores and (potentially) billions of debugging targets.