Методы оптимизации обобщенных тензорных сверток

Роман Альбертович Гареев

Аннотация


Свертка тензоров является одной из основных операций "Тензорного исчисления" – отдельного раздела математики, ставшего основным языком для описания фундаментальных законов таких областей науки, как теория относительности, механика, электродинамика и физика твердого тела. Эффективность выполнения свертки тензоров и её обобщений имеет существенную практическую значимость для таких областей как решение задач математической физики, машинного обучения, в спектральных методах, в квантовой химии, при интеллектуальном анализе данных, в высокопроизводительных вычислениях на многопроцессорных системах, и др. В последние двадцать лет количество методов оптимизации тензорных сверток значительно увеличилось и продолжает возрастать. В статье представлен обзор активно используемых подходов к оптимизации свертки тензоров, применяемых при решении прикладных задач на однопроцессорных и многопроцессорных вычислительных системах с распределенной памятью. В работе представлены методы оптимизации важных частных случаев свертки тензоров – матричного и матрично-векторного произведения, использующихся для большинства оптимизаций сверток тензоров. Описанные оптимазации могут применяться в процессе компиляции программ, выполняемой промышленными компиляторами. Представленная информация может помочь при систематизации уже имеющихся знаний.

Ключевые слова


свертка тензоров; линейная алгебра; высокопроизводительные вычисления

Полный текст:

PDF

Литература


Korenev G.V. Tensor analysis. Moscow, MIPT, 2000. 240 p.

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available at: https://www.tensorflow.org/ (accessed: 02.02.2020).

Pekurovsky D. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions. SIAM Journal on Scientific Computing. 2012. Vol. 34, no. 4. P. 192–209. DOI: 10.1137/11082748X.

Deville M.O., Fischer P.F., Mund E.H High-Order Methods for Incompressible Fluid Flow. Cambridge University Press, 2002. 489 p.

Jayachandran S., Venkatachalam T. A Secure Scheme for Privacy Preserving Data Mining Using Matrix Encoding. Australian Journal of Basic and Applied Sciences (AJBAS). 2015. Available at: http://www.ajbasweb.com/old/ajbas/2015/July/741-744.pdf (accessed: 17.03.2020).

Cong J., Xiao B. Minimizing Computation in Convolutional Neural Networks. Artificial Neural Networks and Machine Learning – ICANN. Springer International Publishing, 2014. P. 281–290. DOI: 10.1007/978-3-319-11179-7_36.

Watson M., Olivares-Amaya R., Edgar R.G., Aspuru-Guzik A. Accelerating Correlated Quantum Chemistry Calculations Using Graphical Processing Units. Computing in Science Engineering. 2010. Vol. 12, no. 4. P. 40–51. DOI: 10.1021/ct900543q.

Akimova E.N., Gareev R.A. Application of Analytical Modeling of Matrix-

Vector Multiplication on Multicore Processors. Bulletin of the South Ural State University. Series: Computational Mathematics and Software Engineering. 2020. Vol. 9, no. 1. P. 69–82. (in Russian) DOI: 10.14529/cmse200105.

Goto K., Geijn R. Anatomy of High-performance Matrix Multiplication. ACM Transactions on Mathematical Software. 2008. Vol. 34, no. 3. P. 1–25. DOI: 10.1145/1356052.1356053.

Yu J., Lukefahr A., Palframan D., Dasika G., Das R., Mahlke S. Scalpel: Customizing DNN pruning to the underlying hardware parallelism. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (Ontario, Canada, June, 24–28, 2017). New York, ACM. 2017. P. 548–560. DOI: 10.1145/3079856.3080215.

Yang X., Parthasarathy S., Sadayappan P. Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining. Proceedings of the VLDB Endowment. 2011. Vol. 4, no. 4. P. 231–242. DOI: 10.14778/1938545.1938548.

Kaushik D., Gropp W., Minkoff M., Smith B. Improving the Performance of Tensor Matrix Vector Multiplication in Cumulative Reaction Probability Based Quantum Chemistry Codes. High Performance Computing – HiPC 2008 (Bangalore, India, December, 17–20, 2008). Heidelberg, Springer-Verlag. 2008. P. 120–130. DOI: 10.1007/978-3-540-89894-8_14.

Lawson C.L., Hanson R.J., Kincaid D.R., Krogh F.T. Basic Linear Algebra Subprograms for Fortran Usage. ACM Transactions on Mathematical Software. 1979. Vol. 5. no. 3. P. 308–323. DOI: 10.1145/355841.355847.

Dongarra J.J., Du Croz J., Hammarling S., Hanson R.J. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software. 1988. Vol. 14, no. 1. P. 1–17. DOI: 10.1145/42288.42291.

Dongarra J.J., Du Croz J., Hammarling S., Duff I.S. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Transactions on Mathematical Software. 1990. Vol. 16, no. 1. P. 1–17. DOI: 10.1145/77626.79170.

Sedukhin S.G., Paprzycki M. Generalizing Matrix Multiplication for Efficient Computations on Modern Computers. Proceedings of the 9th International Conference on Parallel Processing and Applied Mathematics – Volume Part I (Reykjavik, Iceland, September, 9–13, 2006). Heidelberg, Springer-Verlag. 2006. P. 225–234. DOI: 10.1007/978-3-642-31464-3_23.

Low T.M., Igual F.D., Smith T.M., Quintana-Orti E.S. Analytical Modeling Is Enough for High-Performance BLIS. ACM Transactions on Mathematical Software. 2016. Vol. 43, no. 2. P. 12:1–12:18. DOI: 10.1145/2925987.

Gates M., Luszczek P., Abdelfattah A., Kurzak J., Dongarra J., Arturov K., Cecka C., Freitag C. C++ API for BLAS and LAPACK. SLATE Working Notes. 2017. Vol. 2. P. 1–45. Available at: https://www.icl.utk.edu/files/publications/2017/icl-utk-1031-2017.

pdf (accessed: 17.03.2020).

Goto K., Van De Geijn R. High-performance Implementation of the Level-3 BLAS. ACM Transactions on Mathematical Software. 2008. Vol. 35. no. 1. P. 4:1–4:14. DOI: 10.1145/1377603.1377607.

Xianyi Z., Qian W., Yunquan Z. Model-driven level 3 BLAS performance optimization on Loongson 3A processor. 2012 IEEE 18th International Conference on Parallel and Distributed Systems (Singapore, December, 17–19, 2012). Massachusetts Ave. 2012. P. 684–691. DOI: 10.1109/ICPADS.2012.97.

Van Zee F., Van De Geijn R. BLIS: A Framework for Rapidly Instantiating BLAS

Functionality. ACM Transactions on Mathematical Software. 2015. Vol. 41, no. 3. P. 14:1–14:33. DOI: 10.1145/2764454.

Intel Math Kernel Library (Intel MKL) Available at: https://software.intel.com/ru-ru/intel-mkl/?cid=sem43700011401059448&intel_term=intel+mkl&gclid=

CIjbtvqaqM8CFSoNcwodDPUAbw&gclsrc=aw.ds (accessed: 02.02.2020).

IBM Engineering and Scientific Subroutine Library Intel Math Kernel Library (Intel MKL). Available at: https://www.ibm.com/support/knowledgecenter/SSFHY8_6.1/reference/essl_reference_pdf.pdf (accessed: 02.02.2020).

ARM Performance Libraries Reference Manual. Available at: https://ds.arm.com/media/uploads/arm_performance_libraries_reference_manual_arm-ecm-0493690.pdf (accessed: 02.02.2020).

Whaley R.C., Dongarra J.J. Automatically Tuned Linear Algebra Software. Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (Montreal, Canada, November, 7, 1998). New York, ACM. 1998. P. 1–27. DOI: 10.5555/509058.509096.

Bilmes J., Asanovic K., Chin C., Demmel J. Optimizing Matrix Multiply Using PHiPAC: A Portable, High-performance, ANSI C Coding Methodology. Proceedings of the 11th International Conference on Supercomputing (Vienna, Austria, July, 7–11, 1997). New York, ACM. 1997. P. 340–347. DOI: 10.1145/2591635.2667174.

Su X., Liao X., Xue J. Automatic Generation of Fast BLAS3-GEMM: A Portable Compiler Approach. Proceedings of the 2017 International Symposium on Code Generation and Optimization (Austin, TX, USA, February, 4–8, 2017). IEEE Press. 2017. P. 122–133. DOI: 10.1109/CGO.2017.7863734.

Gareev R., Grosser T., Michael K. High-Performance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Transactions on Architecture and Code Optimization. 2018. Vol. 15, no. 3. P. 34:1–34:27. DOI: 10.1145/3235029.

Spampinato D.G., Markus P. A Basic Linear Algebra Compiler. International Symposium on Code Generation and Optimization (Orlando, FL, USA, February, 15–19, 2014). New York, ACM. 2014. P. 23–32. DOI: 10.1145/2581122.2544155.

Belter G., Jessup E.R., Karlin I., Siek J.G. Automating the Generation of Composed Linear Algebra Kernels. Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (Portland, OR, USA, November, 14–20, 2009). New York, ACM. 2009. P. 59:1-59:12. DOI: 10.1145/1654059.1654119.

Yi Q., Wang Q., Cui H. Specializing Compiler Optimizations Through Programmable Composition for Dense Matrix Computations. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, UK, December, 13–17, 2014). Washington, IEEE Computer Society. 2014. P. 596–608. DOI: 10.1109/MICRO.2014.14.

Wang Q., Zhang X., Zhang Y., Yi Q. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (Denver, Colorado, USA, November, 17–22, 2013). New York, ACM. 2013. P. 25:1–25:12. DOI: 10.1145/2503210.2503219.

Knijnenburg P.M., Kisuki T., O’Boyle M.F.P. Iterative Compilation. Embedded Processor Design Challenges: Systems, Architectures, Modeling, and Simulation (SAMOS). Springer Berlin Heidelberg, 2002. P. 171–187. DOI: 10.1007/3-540-45874-3_10.

Yotov K., Xiaoming L., Gang R., Garzaran M.J.S., Padua D., Pingali K., Stodghill P. Is Search Really Necessary to Generate High-Performance BLAS? Proceedings of the IEEE. 2005. Vol. 93, no. 2. P. 358–386. DOI: 10.1109/JPROC.2004.840444.

Akimova E.N., Gareev R.A. Algorithm of Automatic Parallelization of Generalized Matrix Multiplication. CEUR Workshop Proceedings. 2017. Vol. 2005. P. 1–10.

Demmel J. Communication Avoiding Algorithms. Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis (Salt Lake City, UT, USA, November, 10–16, 2012). Massachusetts Ave., IEEE Computer Society. 2012. P. 1942–2000. DOI: 10.1109/SC.Companion.2012.351.

Ballard G., Carson E., Demmel J., Hoemmen M., Knight N., Schwartz O. Communication lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica. 2014. Vol. 23. P. 1–155. DOI: 10.1017/S0962492914000038.

Frigo M., Leiserson C.E., Prokop H., Ramachandran S. Cache-Oblivious Algorithms. Proceedings of the 40th Annual Symposium on Foundations of Computer Science (New York City, New York, USA, October, 17–19, 1999). Massachusetts Ave., IEEE Computer Society. 2012. P. 285–298. DOI: 10.5555/795665.796479.

Yotov K., Roeder T., Pingali K., Gunnels J., Gustavson F. An Experimental Comparison of Cache-oblivious and Cache-conscious Programs. Proceedings of the Nineteenth Annual ACM Symposium on Parallel Algorithms and Architectures (San Diego, California, USA, June, 9–11, 2007). New York, ACM. 2007. P. 93–104. DOI: 10.1145/1248377.1248394.

Liang J., Zhang Y. Optimization of GEMV on Intel AVX processor. International Journal of Database Theory and Application. 2016. Vol. 9. P. 47–60. DOI: 10.14257/ijdta.2016.9.2.06.

Hassan S.A., Mahmoud M.M.M., Hemeida A.M., Saber M.A. Effective Implementation of MatrixVector Multiplication on Intel’s AVX Multicore Processor. Computer Languages, Systems & Structures. 2018. Vol. 51. P. 158–175. DOI: 10.1016/j.cl.2017.06.003.

Akimova E.N., Gareev R.A., Misilov V.E. Analytical Modeling of Matrix-Vector Multiplication on Multicore Processors: Solving Inverse Gravimetry Problem. 2019 International Multi-Conference on Engineering, Computer and Information Sciences (Novosibirsk, Russia, October, 21–27, 2019). Massachusetts, IEEE Xplore Digital Library. 2019. P. 0823–0827. DOI: 10.1109/SIBIRCON48586.2019.8958103.

Heinecke A., Pabst H., Henry G. LIBXSMM: A high performance library for small matrix multiplications. HPC transforms (Austin, TX, USA, November, 15–20, 2015). New York, ACM. 2015. P. 1–1. DOI: 10.1109/SC.2016.83.

Napoli E.D., Fabregat-Traver D., Quintana-Ort´ı G., Bientinesi P. Towards an efficient use of the BLAS library for multilinear tensor contractions. Applied Mathematics and Computation. 2014. Vol. 235. P. 454–468. DOI: 10.1016/j.amc.2014.02.051.

Matthews D. High-Performance Tensor Contraction without BLAS. SIAM Journal on Scientific Computing. 2018. Vol. 40, no. 1. P. 1–24. DOI: 10.1137/16m108968x.

Springer P., Bientinesi P. Design of a High-Performance GEMM-like Tensor-Tensor Multiplication. ACM Transactions on Mathematical Software. 2018. Vol. 44, no. 3. P. 28:1–28:29. DOI: 10.1145/3157733.

Hirata S. Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories. The Journal of Physical Chemistry A. 2003. Vol. 107, no. 46. P. 9887–9897. DOI: 10.1021/jp034596z.

Bader B.W., Kolda G.T. Algorithm 862: MATLAB tensor classes for fast algorithm prototyping. ACM Transactions on Mathematical Software. 2006. Vol. 32. P. 635–653. DOI: 10.1145/1186785.1186794.

Walt S., Colbert C., Varoquaux G. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering. 2011. Vol. 13. P. 22–30. DOI: 10.1109/MCSE.2011.37.

Eigen v3. Available at: http://eigen.tuxfamily.org (accessed: 02.02.2020).

Solomonik E., Matthews D., Hammond J., Stanton J.F., Demmel J. A massively parallel tensor contraction framework for coupled-cluster computations. Journal of Parallel and Distributed Computing. 2014. Vol. 74, no. 12. P. 3176–3190. DOI: 10.1016/j.jpdc.2014.06.002.

Epifanovsky E., Wormit M., Ku T., Landau A., Zuev D., Khistyaev K., Manohar P., Kaliman I., Dreuw A., Krylov A. New implementation of high-level correlated methods using a general block tensor library for high-performance electronic structure calculations. Journal of computational chemistry. 2013. Vol. 34. P. 2293–2309. DOI: 10.1002/jcc.23377.

Calvin J., Lewis C.A., Valeev E.F. Scalable Task-Based Algorithm for Multiplication of Block-Rank-Sparse Matrices. Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas, USA, November, 15, 2015). New York, ACM. 2015. P. 4:1–4:8. DOI: 10.1145/2833179.2833186.

Calvin J., Valeev E.F. Task-Based Algorithm for Matrix Multiplication: A Step Towards Block-Sparse Tensor Computing. Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas, USA, November, 15, 2015). New York, ACM. 2015. P. 1–9.

Lattner C. LLVM: An Infrastructure for Multi-Stage Optimization. Master’s Thesis. 2002. Available at: http://llvm.cs.uiuc.edu (accessed: 27.10.2019).

Grosser T., Groblinger A., Lengauer C. Polly—Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters. 2012. Vol. 22. No. 4. DOI: 10.1142/S0129626412500107.

Apra E., Klemm M., Kowalski K. Efficient Implementation of Many-Body Quantum Chemical Methods on the Intel E; Xeon Phi Coprocessor. International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, USA, November, 16–21, 2014). Massachusetts Ave., IEEE Computer Society. 2014. P. 674–684. DOI: 10.1109/SC.2014.60.

Stock K., Henretty T., Murugandi I., Sadayappan P., Harrison R. Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel. Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium (New Orleans, LA, USA, May, 16–20, 2011). Massachusetts Ave., IEEE Computer Society. 2014. P. 1058–1067. DOI: 10.1109/IPDPS.2011.101.

Ma W., Krishnamoorthy S., Villa O., Kowalski K. GPU-Based Implementations of the Noniterative Regularized-CCSD(T) Corrections: Applications to Strongly Correlated Systems. Journal of Chemical Theory and Computation. 2011. Vol. 7, no. 5. P. 1316–1327. DOI: 10.1021/ct1007247.




DOI: http://dx.doi.org/10.14529/cmse200202