Distributed Computational Experiments in the MLOps Platform of HSE University

Anton S. Khritankov, Valentin A. Polezhaev, Georgiy A. Zhulikov, Maksim S. Halynchik, Nikita A. Klimin, Kirill E. Sakharov, Viktor O. Minchenkov, Ivan V. Spirin, Ivan I. Krupnov, Sofia F. Yakusheva, Aleksandra S. Maratkanova, Vyacheslav I. Kozyrev, Pavel S. Kostenetskiy, Hadi M. Salekh

Abstract


Despite the wide spread and successful application of data mining and processing tools for solving individual applied problems, the problem of developing a technology for creating such software tools has not yet been solved. In the context of a unified MLOps process for creating machine learning technologies, this paper considers the emerging problems of automating and executing distributed computing experiments on a hybrid cloud computing platform. The MLOps platform being developed at HSE University is designed to deploy intelligent services and data analysis software. The platform shall manage heterogeneous resources available locally and in the cloud environment and combine them with the resources of the HSE cHARISMa computing cluster managed with Slurm. Thus, relevant is the problem of integrating the specified resources for conducting computational experiments, implementing pipelines for setting up machine learning models, solving problems of data processing and analysis. The features of the problem being solved are the consideration of the computation process as an integral part of the technology for creating intelligent services, the need for using heterogeneous resources for this technology, and the use of the hybrid platform for the execution of computations. The paper proposes a solution to the problem of integrating computations and presents the results of testing the solution for intelligent services. We show the feasibility of such integration of heterogeneous resources in the same computational experiment based on an object model of the experiment extended by the user and a domain-specific language for its specification, and resolve the issues of dynamic management of the deployment of intelligent applications, integration of data processing pipelines, services and data sets for performing distributed computational experiments.

Keywords


distributed computing experiments; machine learning; cloud technologies; MLOps

References


Korenkov V. GRID technologies: status and prospectives. Herald of the International Academy of Science. Russian Section. 2010. No. 1. P. 41–44. (in Russian) DOI: 10.3997/2214-4609.20142827.

Pimenov A., Fedorov I., Bezzateev S. Fog computing architecture using blockchain technology. Information and Control Systems. 2022. Oct. No. 5. P. 40–48. (in Russian) DOI: 10.31799/1684-8853-2022-5-40-48.

Sukhoroslov O.V., Afanasiev A. Everest: A Cloud Platform for Computational Web Services. CLOSER. 2014. P. 411–416. DOI: 10.5220/0004941404110416.

Centre of Artificial Intelligence – HSE University. 2024. URL: https://cs.hse.ru/aicenter/ (in Russian).

Antonenko V., Chupakhin A., Kolosov A., et al. On HPC and Cloud Environments Integration. Performance Evaluation Models for Distributed Service Networks. Springer, 2021. P. 159–185. DOI: 10.1007/978-3-030-67063-4_8.

Ejarque J., Badia R.M., Albertin L., et al. Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence. Future generation computer systems. 2022. Vol. 134. P. 414–429. DOI: 10.1016/j.future.2022.04.014.

Sukhoroslov O. Combined use of high-performance resources and Grid infrastructures within the Everest cloud platform. Supercomputer Days in Russia. 2015. P. 706–711. (in Russian).

Velikhov V., Klimentov A., Mashinistov R., et al. Integration of heterogeneous computing resources at NRI “Kurchatov Institute” for large-scale scientific computations. Izvestiya SFedU. Engineering Sciences. 2016. No. 11 (184). P. 88–100. (in Russian).

Kutovskiy N., Mitsyn V., Moshkin A., et al. Integration of distributed heterogeneous computing resources for the mpd experiment with DIRAC Interware. Physics of Particles and Nuclei. 2021. Vol. 52, no. 4. P. 999. (in Russian).

Feoktistov A.G., Sidorov I.A., Sergeev V.V., et al. Virtualization of heterogeneous HPCclusters based on OpenStack platform. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2017. Vol. 6, no. 2. P. 37–48. (in Russian) DOI: 10.14529/cmse170203.

Silva R.F.D., Badia R.M., Bard D., et al. Frontiers in scientific workflows: Pervasive integration with high-performance computing. Computer. 2024. Vol. 57, no. 8. P. 36–44. DOI: 10.1109/mc.2024.3401542.

Stubbs J., Cardone R., Packard M., et al. Tapis: An API platform for reproducible, distributed computational research. Advances in Information and Communication: Proceedings of the 2021 Future of Information and Communication Conference (FICC), Vol. 1. Springer, 2021. P. 878–900. DOI: 10.1007/978-3-030-73100-7_61.

Vorontsov K., Iglovikov V., Strijov V., et al. Roundtable: Challenges in repeatable experiments and reproducible research in data science. Proceedings of Moscow Institute of Physics and Technology. 2021. Vol. 13, no. 2 (50). P. 100–108. (in Russian) DOI: 10.53815/20726759_2021_13_2_100.

Khritankov A., Pershin N., Ukhov N., Ukhov A. MLDev: Data Science Experiment Automation and Reproducibility Software. International Conference on Data Analytics and Management in Data Intensive Domains. Springer, 2021. P. 3–18. DOI: 10.1007/978-3-031-12285-9_1.

Alam K., Roy B. Challenges of provenance in scientific workflow management systems. 2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS). IEEE, 2022. P. 10–18. DOI: 10.1109/works56498.2022.00007.

Dhruv A., Dubey A. Managing software provenance to enhance reproducibility in computational research. Computing in Science & Engineering. 2023. Vol. 25, no. 3. P. 60–65. DOI: 10.1109/mcse.2023.3314288.

Zybin R., Shvetsova V., Badalyan D., et al. Cloud environment “Asperitas”. 2022. (in Russian). Certificate of state registration of a computer program RU 2022682679.

Grushin D., Samovarov O., Hashba E. SaaS platform for organizing a unified web environment for research, development and education “Fanlight”. 2018. (in Russian). Certificate of state registration of a computer program RU 2018615444.

Nasonov D., Butakov N., Bukhanovsky A., et al. Technology for organizing management and processing big data – DataMall. 2020. (in Russian). Certificate of state registration of a computer program RU 2020664222.

Kreuzberger D., Kühl N., Hirschl S. Machine learning operations (MLOPS): Overview, definition, and architecture. IEEE Access. 2023. Vol. 11. P. 31866–31879. DOI: 10.1109/access.2023.3262138.

Tyutlyaeva E.O., Odintsov I.O., Marmuzov G.V., et al. Development trends of modern supercomputers. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2019. Vol. 8, no. 3. P. 92–114. (in Russian) DOI: 10.14529/cmse190305.

Wilkinson M.D., Dumontier M., Aalbersberg I.J., et al. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data. 2016. Vol. 3, no. 1. P. 1–9. DOI: 10.1038/sdata.2016.18.

Kostenetskiy P., Shamsutdinov A., Chulkevich R., et al. HPC TaskMaster - Task Efficiency Monitoring System for the Supercomputer Center. Parallel Computational Technologies / ed. by L. Sokolinsky, M. Zymbler. Cham: Springer International Publishing, Jan. 2022. P. 17–29. DOI: 10.1007/978-3-031-11623-0_2.

Kostenetskiy P., Kozyrev V., Chulkevich R., Raimova A. Enhancement of the Data Analysis Subsystem in the Task-Efficiency Monitoring System HPC TaskMaster for the cHARISMa Supercomputer Complex at HSE University. Parallel Computational Technologies / ed. by L. Sokolinsky, M. Zymbler, V. Voevodin, J. Dongarra. Cham: Springer Nature Switzerland, 2024. P. 49–64. DOI: 10.1007/978-3-031-73372-7_4.

Lyu C., Zhang W., Huang H., et al. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. CoRR. 2022. Vol. abs/2212.07784. DOI: 10.48550/ARXIV.2212.07784. arXiv: 2212.07784.

Slastnikov S., Chertova E. Machine vision model synthesis module for object and action detection. 2024. URL: https://cs.hse.ru/aicenter/rid_detection (in Russian).

Slastnikov S., Chertova E. A program for synthesis of machine vision models to detect objects and activities. 2023. (in Russian). Certificate of state registration of a computer program RU 2023660157.

Khritankov A.S. A method for performance analysis of distributed applications based on reference models. Parallel Computational Technologies (PCT’2011). 2011. P. 343–354. (in Russian).

Kostenetskiy P., Chulkevich R., Kozyrev V. HPC Resources of the Higher School of Economics. Journal of Physics: Conference Series. 2021. Jan. Vol. 1740, no. 1. P. 012050. DOI: 10.1088/1742-6596/1740/1/012050.




DOI: http://dx.doi.org/10.14529/cmse250203