Lightning Talk: Profiling and Memory Debugging Tools for Distributed ML Workloads on GPUs- Aaron Shi

Опубликовано: 08 Октябрь 2024
на канале: PyTorch

1,638

Lightning Talk: Profiling and Memory Debugging Tools for Distributed ML Workloads on GPUs - Aaron Shi, Meta

An overview of PyTorch profiling tools and features (Profiler and Kineto) followed by a practical dive into our extensive GPU memory debugging tools. The PyTorch Profiler will introduce the Memory Profiler for better understanding of GPU memory, as well as newly released OSS repos such as Holistic Trace Analysis (used to understand distributed profiler traces and provide useful views), and Dynolog (used for triggering on-demand traces). Followed by a look into new GPU memory debugging tools for PyTorch: Memory Snapshot, and Reference Cycle Detector. May take a practical approach in understanding memory leaks, fragmentation and reference cycles.