Обеспечение отказоустойчивости высокопроизводительных вычислений с помощью локальных контрольных точек

Алексей Алексеевич Бондаренко; Михаил Владимирович Якобовский

doi:10.14529/cmse140302

Авторы

Алексей Алексеевич Бондаренко Институт прикладной математики им. М.В.Келдыша РАН (Москва, Российская Федерация)
Михаил Владимирович Якобовский Институт прикладной математики им. М.В.Келдыша РАН (Москва, Российская Федерация),

DOI:

https://doi.org/10.14529/cmse140302

Ключевые слова:

параллельные вычисления, отказоустойчивость, контрольные точки, MPI

Аннотация

Рассматриваются вопросы, связанные с проведением расчетов в распределенных вычислительных системах, компоненты которых подвержены отказам. В работе приводятся: определения системы, сбоя, ошибки, отказа и модели сбоя; наиболее важные результаты исследований отказов в параллельных вычислительных системах, в том числе с большими группами дисков; основные существующие методы восстановления и распространенные программные реализации обеспечения отказоустойчивости. Развивается подход обеспечения отказоустойчивости на уровне пользователя. Данный подход требует непосредственного участия разработчика прикладной программы в реализации метода обеспечения отказоустойчивости, в частности в формировании контрольных точек и процедур восстановления. Предложена схема сохранения в памяти вычислительных узлов данных прикладной программы, формирующих согласованную глобальную контрольную точку. В её рамках осуществляется дублирование локальных контрольных точек, что позволяет восстановить вычислительный процесс, если число отказов не превосходит допустимого для данной схемы уровня. Она может быть использована в различных протоколах восстановления и их модификациях.

Библиографические ссылки

Bland, W. Post-failure recovery of MPI communication capability: Design and rationale / W. Bland, A. Bouteiller, T. Hérault, G. Bosilca, J. Dongarra // International Journal of High Performance Computing Applications. — 2013. — Vol. 27, No. 3. — P. 244–254.

Cappello, F. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities / Cappello F. // International Journal of High Performance Computing Applications. — 2009. — Vol. 23, No. 3. — P. 212–226.

Hsu, C.-H. A power-aware run-time system for high-performance computing / C.- H. Hsu, W.-C. Feng. // Proceedings of SC|05: The ACM/IEEE International Conference on High-Performance Computing, Networking, and Storage (Seattle, Washington USA November 12 – 18, 2005). — IEEE Press, 2005. — P. 1–9. 4. Sorin, D. Fault Tolerant Computer Architecture. Synthesis Lectures on Computer Architecture / D. Sorin — Morgan&Claypool, 2009. — 104 p.

Elnozahy, E.N. A Survey of Rollback-Recovery Protocols in Message-Passing Systems / E.N. Elnozahy, L. Alvisi, Y. Wang, D.B. Johnson // ACM Computing Surveys. — 2002. — Vol.34, No. 3 — P. 375–408.

Koren, I. Fault-Tolerant Systems / I. Koren, C. M. Krishna — San Francisco, CA: Morgan Kaufmann Publishers Inc., 2007. — 378 p.

Tanenbaum, A.S. Distributed Systems: Principles and Paradigms / A.S. Tanenbaum, M. Steen — New Jersey, Prentice Hall PTR, 2002. — 803 p.

Kogge, P.M. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems — Tech. Report TR-2008-13. — Univ. of Notre Dame, CSE Dept. — 2008. / P.M. Kogge, et al. URL: http://www.cse.nd.edu/Reports/2008/TR-2008-13.pdf (accessed: 25.07.2014).

Avizienis, A. Basic Concepts and Taxonomy of Dependable and Secure Computing / A. Avizienis, J.C. Laprie, B. Randell, C. Landwehr // IEEE Transactions on Dependable and Secure Computing. — 2004. — Vol. 1, — P. 11–33.

Jalote, P. Fault Tolerance in Distributed Systems / P. Jalote — New Jersey, Prentice Hall, 1994 — 448 p.

Tel, G. Introduction to Distributed Algorithms / G. Tel — Cambridge University Press, 2000. — 596 p.

The computer failure data repository URL: https://www.usenix.org/cfdr (accessed: 25.07.2014).

Addressing the challenges of petascale computing for scientific discovery on information storage capacity, performance, concurrency, reliability, availability, and manageability URL: http://pdsi.nersc.gov/ (accessed: 25.07.2014).

Yuan, Y. Job failures in high performance computing systems: A large-scale empirical study / Y. Yuan, Y. Wu, Q. Wang, G. Yang, W. Zheng // Computers & Mathematics with Applications. — 2012. — Vol. 63, No 2. — P. 365–377.

Dong, X. A Case Study of Incremental and Background Hybrid In-Memory Checkpointing / X. Dong, N. Muralimanohar, N.P. Jouppi, Y. Xie // Proceedings of the 2010 Exascale Evaluation and Research Techniques Workshop (Pittsburgh, PA, USA March 13 – 14, 2010), — ACM, 2010 — P. 119–147.

Schroeder, B. Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? / B. Schroeder, G.A. Gibson // Proceedings of the 5th USENIX Conference on File and Storage Technologies (San Jose, CA, USA February 13–16 2007) — USENIX, 2007. — P. 1–16.

Ferreira, K.B. Accelerating incremental checkpointing for extreme-scale computing / K.B. Ferreira, R. Riesen, P.G. Bridges, D. Arnold, R. Brightwell // Future Generation Computer Systems. — 2014. — Vol. 30, No 1. — P. 66–77.

Polyakov, A.Yu. Optimizatsiya vremeni sozdaniya i objema kontrolnykh tochek voss- tanovleniya parallelnykh program [Optimization of time creation and checkpoint’s volume for parallel programs] // Vestnik SibGUTI [Bulletin of Siberian State University of Telecommunications and Information Sciences]. 2010. No. 2. P. 87-100.

Vaidya, N.H. A Case for Two-Level Distributed Recovery Schemes / N.H. Vaidya // Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems (Ottawa, Canada, May 15-19 1995) — ACM, 1995, — P. 64–73. 20. Plank, J.S. Diskless Checkpointing / J.S. Plank, K. Li, M.A, Puening // IEEE Transactions on Parallel Distributed Systems. — 1998. — Vol. 9, No 10. — P. 972–986. 21. X-COM parallel.ru URL: http://x-com.parallel.ru/node/10 (accessed: 25.07.2014).

Baranov, A.V. Programnyj kompleks «Piramida» organizatsii parallelnykh vychislenij s rasparallelivaniem po dannym [software package «Pyramid» for organization of parallel computing with parallelization of data] URL: http://agora.guru.ru/abrau2010/pdf/299.pdf (accessed: 25.07.2014).

OpenTS – tekhnologiya i programmnoe obespechenie podderzhki rasparallelivaniya programm [Technology and Software Support for Parallelization of Data-Parallel Applications] URL: http://skif.pereslavl.ru/psi-info/rcms-open.ts/index.ru.html (accessed: 25.07.2014).

HTCondor high throughput computing URL: http://research.cs.wisc.edu/htcondor/index.html (accessed: 25.07.2014).

Berkeley Lab Checkpoint/Restart (BLCR) for LINUX URL: http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/BLCR/ (accessed: 25.07.2014).

Open MPI: Open Source High Performance Computing URL: http://www.open-mpi.org (accessed: 25.07.2014).

MPICH URL: http://www.mpich.org (accessed: 25.07.2014).

MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE URL: http://mvapich.cse.ohio-state.edu (accessed: 25.07.2014). 29. Egwutuoha, I.P. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. / I.P. Egwutuoha, D. Levy, B. Selic, S. Chen // The Journal of Supercomputing. — 2013. — Vol. 65, No. 3. — P. 1302–1326. 30. Message Passing Interface Forum URL: http://www.mpi-forum.org/ (accessed: 25.07.2014).

ICL Fault Tolerance URL: http://fault-tolerance.org/ulfm/ulfm-specification (accessed: 25.07.2014).

Dong, X. Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems, / X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, Y. Xie // Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, Oregon USA November 14-20, 2009). — ACM, 2009. — P. 57-68.

FT-MPI URL: http://icl.cs.utk.edu/ftmpi/people/index.html (accessed: 25.07.2014).