Use cases for performance information
- identify critical path in asynchronous programs (where are the synchronization or wait points in a program?, etc)
- identify places where a user could reduce power on the GPU without negatively affecting application performance  (for example, if the CPU is never waiting for the GPU, can the GPU be run at lower power?)
- get more event information dynamically as part of a trace
- provide high level job utilization information (what is useful to show?)

CUPTI capabilities
- discussed how to use CUPTI in a sampling environment when information is desired from the CPU and GPU to understand overlap
- use getTimeStamp to get a global timestamp (NVIDIA will look into the overhead of this call)

NVIDIA terms:
Tracing - information collected through the activity API
Profiling - collecting counts (HW counters, SW counters), more expensive, serializes code

- can only enable counters through the event API

- events serialize execution (There is a synchronization at the beginning and end of each kernel to ensure that there is only one kernel executing on the device.  This API is useful for more accurate counter collection).

- there is one set of counters per device, counters are context switched between processes, but they count everything that's running at the time on the GPU

- there was a request to snapshot counts along with time at begin and end of a kernel through the activity API.  This will not provide accurate counter information per kernel, but will provide a higher level view of how much activity is going on on the device throughout application execution (NVIDIA will look into this possibly using the patch mechanism)

- there are 4 counters, some events take more than 1 counter so you can't always count 4 events at time
- There is a new replay feature coming in 5.5/6.0 that seamlessly replays a kernel execution until information for all requested events has been collected.  Tools should offer a way for users to select specific kernels for event collection as the overhead of rerunning every kernel in a program many times would be too high.  Default event collection behavior will not change, and there will be an option to enable replay in CUPTI.

- there are a couple of new activity records in 5.0 that return event information at the source line level.  These should also only be used on select kernels within an application due to their overhead (the implementation uses patches).

GPU Direct - there performance metrics specific for this feature yet.  A CUDA memcpy can occur between GPUs, but you'd want to see both ends of the memcpy and there is currently no way to do so.  Not sure yet what sort of metrics would be useful.

High level performance information
- what sort of information is useful at a high level to direct users where to focus further analysis and optimization efforts?

- one idea is to show how close the job is to bottlenecks associated with compute, memory and PCIe traffic.  NVIDIA will look into providing these types of metrics. 

- flops was another metric discussed that NVIDIA could possibly add

- Cray is proposing the following metrics for high level per-job GPU information (more for a site analyst tool), and wanted feedback if they were indeed useful:  number of GPU contexts opened (how many processes were run on the GPU), number of GPU kernels run, amount of time accrued while running the kernels, high water mark of memory used on the GPU.  These should all be possible to collect.