TASC Software for HPC Performance Analysis: Current State and Latest Developments

Vadim V. Voevodin, Denis I. Shaikhislamov, Vladimir A. Serov

Аннотация


To ensure high operating efficiency of modern supercomputers, it is necessary to constantly analyze and control various aspects of their behavior, paying special attention to the flow of supercomputer applications running on these machines. To solve this problem, the TASC (Tuning Applications for SuperComputers) software suite was previously developed. It automatically detects performance issues in HPC applications and evaluates the efficiency of using supercomputer resources, provides supercomputer administrators with a flexible report tool for analyzing different aspects of supercomputer functioning with the desired level of detail, and estimates the noise level on compute nodes. This paper provides full-scale description of current TASC structure and capabilities, including the stages of data processing and storing, as well as performing different types of analysis. It also describes new results obtained and methods developed within one of the main TASC components — assessment system for quick and accurate evaluation of HPC resources usage efficiency.

Ключевые слова


high-performance computing; supercomputer; performance analysis; workload analysis; operational data analytics; monitoring

Полный текст:

PDF (English)

Литература


Voevodin V.V., Shaikhislamov D.I., Nikitenko D.A. How to assess the quality of supercomputer resource usage. Supercomputing Frontiers and Innovations. 2022. Vol. 9, no. 3. P. 4–18. DOI: 10.14529/jsfi220301.

High Performance Computing Market Size to Surpass USD 64.65. URL: https://www.globenewswire.com/news-release/2022/04/04/2415844/0/en/High-Performance-Computing-Market-Size-to-Surpass-USD-64-65-Bn-by-2030.html (accessed: 14.08.2024).

High Performance Computing Market Size, Growth Report. URL: https://www.fortunebusinessinsights.com/industry-reports/high-performance-computing-hpc-and-high-performance-data-analytics-hpda-market-100636 (accessed: 14.08.2024).

Shvets P., Voevodin V., Zhumatiy S. Primary automatic analysis of the entire flow of supercomputer applications. CEUR Workshop Proceedings. 2018. P. 20–32.

Shvets P., Voevodin V. “Endless” Workload Analysis of Large-Scale Supercomputers. Lobachevskii Journal of Mathematics. 2021. Vol. 42. P. 184–194. DOI: 10.1134/s1995080221010236.

Voevodin V.V., Nikitenko D.A. Recurrent Monitoring of Supercomputer Noise. Supercomputing Frontiers and Innovations. 2023. Vol. 10, no. 3. P. 27–35. DOI: 10.14529/jsfi230304.

Jones M.D., White J.P., Innus M., et al. Workload Analysis of Blue Waters. 2017. DOI: 10.48550/arXiv.1703.00924. arXiv: 1703.00924.

Simakov N.A., White J.P., DeLeon R.L., et al. A Workload Analysis of NSF’s Innovative HPC Resources Using XDMoD. 2018. DOI: 10.48550/arXiv.1801.04306. arXiv: 1801.04306.

Hart D.L. Measuring TeraGrid: workload characterization for a high-performance computing federation. The International Journal of High Performance Computing Applications. 2011. Nov. Vol. 25, no. 4. P. 451–465. DOI: 10.1177/1094342010394382.

Patel T., Liu Z., Kettimuthu R., et al. Job characteristics on large-scale systems: longterm analysis, quantification, and implications. SC20: International conference for high performance computing, networking, storage and analysis. IEEE, 2020. P. 1–17. DOI: 10.1109/SC41405.2020.00088.

Kostenetskiy P., Shamsutdinov A., Chulkevich R., et al. HPC TaskMaster-Task Efficiency Monitoring System for the Supercomputer Center. International Conference on Parallel Computational Technologies. Springer, 2022. P. 17–29. DOI: 10.1007/978-3-031-11623-0_2.

Isakov M., Del Rosario E., Madireddy S., et al. HPC I/O throughput bottleneck analysis with explainable local models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020. P. 1–13. DOI: 10.1109/SC41405.2020.00037.

Netti A., Shin W., Ott M., et al. A conceptual framework for HPC operational data analytics. 2021 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2021. P. 596–603. DOI: 10.1109/Cluster48925.2021.00086.

Ott M., Shin W., Bourassa N., et al. Global experiences with HPC operational data measurement, collection and analysis. 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2020. P. 499–508. DOI: 10.1109/CLUSTER49012.2020.00071.

Voevodin V.V., Antonov A.S., Nikitenko D.A., et al. Supercomputer Lomonosov-2: large scale, deep monitoring and fine analytics for the user community. Supercomputing Frontiers and Innovations. 2019. Vol. 6, no. 2. P. 4–11. DOI: 10.14529/jsfi190201.

Stefanov K., Voevodin V., Zhumatiy S., Voevodin V. Dynamically reconfigurable distributed modular monitoring system for supercomputers (DiMMon). Procedia Computer Science. 2015. Vol. 66. P. 625–634. DOI: 10.1016/j.procs.2015.11.071.

Agrawal K., Fahey M.R., McLay R., James D. User Environment Tracking and Problem Detection with XALT. 2014 First International Workshop on HPC User Support Tools. IEEE, Nov. 2014. P. 32–40. DOI: 10.1109/HUST.2014.6.

Nikitenko D., Zhumatiy S., Paokin A., et al. Evolution of the Octoshell HPC center management system. International Conference on Parallel Computational Technologies. Springer, 2019. P. 19–33. DOI: 10.1007/978-3-030-28163-2_2.

Hoefler T., Mehlan T., Lumsdaine A., Rehm W. Netgauge: A network performance measurement framework. International Conference on High Performance Computing and Communications. Springer, 2007. P. 659–671.

Netgauge - Operating System Noise Measurement. URL: https://htor.inf.ethz.ch/research/netgauge/osnoise/ (accessed: 25.09.2024).

Top-down Microarchitecture Analysis Method. URL: https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitectureanalysis-method.html#GUID-FEA77CD8-F9F1-446A-8102-07D3234CDB68 (accessed: 14.08.2024).

Voevodin V., Stefanov K., Zhumatiy S. Overhead analysis for performance monitoring counters multiplexing. Russian Supercomputing Days. Springer, 2022. P. 461–474. DOI: 10.1007/978-3-031-22941-1_34.

Abraham M.J., Murtola T., Schulz R., et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015. Sept. Vol. 1–2. P. 19–25. DOI: 10.1016/j.softx.2015.06.001.

Thompson A.P., Aktulga H.M., Berger R., et al. LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Comp. Phys. Comm. 2022. Vol. 271. P. 108171. DOI: 10.1016/j.cpc.2021.108171.




DOI: http://dx.doi.org/10.14529/cmse240304