One particularly frustrating aspect of parallel programming is that when your program crashes or hangs, it’s difficult or impossible to determine how far execution got. System messages are notoriously cryptic about program failures.
Even if you’ve inserted print statements to keep track of progress, that final (and most important!) line may be stuck in a buffer and never appear as output.
All parallel computers and workstations support some kind of corefile mechanism, but this is not a particularly helpful solution. Off-loading the core image of a program executing on hundreds of processors can delay everyone’s work — and may well fill up all available disk space.
The corefile information is at too low a level to interpret it directly. It’s necessary to use a debugger or some other tool just to determine at what line the program failed. If the program wasn’t compiled with debugging options enabled, it may be impossible to get any useful information. Even if the debugger is capable of telling you the crucial information, you have to invoke, wait for, and interact with a complex tool. All in the interests of acquiring some very basic information about execution!
The Ptools Lightweight Corefile Browser project was formed in response to this need. The goal was to create a tool which quickly and easily provides a high-level view of where the program was when it terminated. Both graphical and command-line versions are available. The tool automatically assimilates the details from tens or hundreds of processes and presents them in a consolidated summary form representing the dynamic call graph of a parallel application.
What The Lightweight Corefile Browser Does
The Lightweight Corefile Browser project (LCB) provides a mechanism for capturing and representing the dynamic state of a parallel application that potentially involves hundreds of processes. There are three main components to the project.
All components are defined in a flexible way to permit parallel computer and workstation vendors to implement them by taking advantage of existing facilities for corefile generation.
How the Browsers Work
LCB is a simple tool with one goal: to provide a high-level view of the dynamic calling structure of the program, at the moment it terminated. The command-line browser reads the corefile (or accepts information on-the-fly from the operating system or some other tool), strips out all data concerning the culprit process and the reason it failed, and presents the results in a simple traceback format. This offers a quick-and-dirty way to find out “what happened” to your program.
To view the dynamic state of the entire parallel application, the graphical browser is invoked.
Initial LCB display — Overview Graph
This shows the current location of the program in the form of a call graph, where each node represents a routine in the call stack of one or more processes. The coloring of the nodes indicates how many processes were active in each routine; black nodes represent routines that were suspended when calls to other routines were made. The routine(s) where program failure occurred is colored red. A highlighted (or white) node is the one currently selected. A message line at the bottom of the display changes as the cursor is moved across the graph, indicating the routine name associated with a node as well as the number of processes executing the node.
To view the names of all routines, the view is changed to the “Call Graph” by selecting the appropriate button from the controls at the bottom of the screen. This brings up a more obvious call graph.
LCB Call Graph display
By clicking on a node, the user can bring up a window showing the location of all processes currently executing the node. It is also possible to search the graph for a specific routine by name.
The graphical browser can also be incorporated into other tools – such as interactive debuggers or performance analysis tools – where users might want to view a snapshot presentation of parallel program execution. For this mode, we recommend that vendors integrate LCB into their tools such that when the user clicks on a routine name, another window is brought up, displaying the source code associated with that routine.