SOLVING GRID EQUATIONS USING THE ALTERNATING-TRIANGULAR METHOD ON A GRAPHICS ACCELERATOR\ast

c © 2023 A.I. Sukhinov1, V.N. Litvinov1,2, A.E. Chistyakov1, A.V. Nikitina1,3, N.N. Gracheva1,2, N.B. Rudenko1,2 1Don State Technical University (Gagarin Sq. 1, Rostov-on-Don, 344003 Russia), 2Azov-Black Sea Engineering Institute of Don State Agrarian University (Lenina 21, Zernograd, 347740 Russia), 3Southern Federal University (Bolshaya Sadovaya 105/42, Rostov-on-Don, 344006 Russia) E-mail: sukhinov@gmail.com, litvinovvn@rambler.ru, cheese_05@mail.ru, nikitina.vm@gmail.com, 79286051374@yandex.ru, nelli-rud@yandex.ru Received: 15.03.2023


Introduction
Modeling of any physical processes occurring in the environment and their mathematical description leads to the necessity to solve differential equations in private derivatives. To study dynamic processes in hydrophysics and hydrodynamics, the diffusion-convection-reaction equation is used [1,2].
The solution to the equations of mathematical physics is based on the approximation of equations of end-and-character schemes. In the case of the use of an implicit, non-exposure scheme, the solution of equation is reduced to solving the system of linear algebraic equations of a large dimension. The largest computational costs in solving differential equations are at solution to the indicated SLAE, therefore, various iterative methods and algorithms are developed and

Method of Solving Grid Equations
Solving the equations of mathematical physics can be reduced to solving a system of linear algebraic equations of the form: where A is the linear, positive definite operator. For the grid equation (1), the iterative methods are used, which in canonical form can be represented by the equation [1,2]: where m is the iteration number; \tau m+1 > 0 is the iteration parameter; B is the preconditioner. The resulting grid equations will be solved using the modified alternating-triangular method of variational type. The preconditioner is formed as follows: where D is the diagonal operator; R 1 , R 2 are the lower-and upper-triangular operators, respectively. The calculation algorithm of grid equations by the modified alternating-triangular method of the variational type is written in the form: , where r m is the residual vector; w m is the correction vector; the parameter s m describes the convergence rate of the method; k m describes the ratio of the norm of the skew-symmetric operator part to the norm of the symmetric part. The most labors part of the algorithm is the calculation of the correction vector w m and reduced to the solution of two SLAE with the lower-triangular and upper-triangular matrix: The algorithm fragment for solving SLAE with the lower-triangular matrix is given in Algorithm 1. The residual vector is calculated in 14N arithmetic operations. The total number of Algorithm 1 matm(IN: n 1 , n 2 , n 3 , a 0 , a 2 , a 4 , a 6 , \omega ; IN/OUT: r) if a 0 [p 0 ] > 0 then 6: p 2 \leftarr p 0 -1; p 4 \leftarr p 0n 1 ; p 6 \leftarr p 0n 1 \cdot n 2 7: arithmetic operations required to solve the SLAE with the seven-diagonal matrix using MATM in the case of known iterative parameters \tau m+1 , \omega m+1 is 35N , where N = n 1 n 2 n 3 is SLAE dimension.

Decomposition Model of Computational Domain
Let Q be the set of technical characteristics of the video adapter, then we will present the characteristics of the video adapter in the form of a tuple.
where q 1 is the amount of video memory of the video adapter, GB; q 2 is the number of streaming multiprocessors. If S is a set of program threads involved in the computational process, then where s k is a CUDA streaming block that implements the calculation process on GPU streaming multiprocessor with index k.
Let us take the computational domain with the following parameters: l x is the characteristic size on the axis Ox, l y -on the axis Oy, l z -on the axis Oz.
Let us compare the specified area with a uniform computational grid of the following type: where h x , h y , h z are the steps of computational grid at the corresponding spatial directions; n x , n y , n z are the number of grid nodes at the corresponding spatial directions. Then the set of nodes of the computational grid can be represented as where g i,j,k is the grid node. The number of nodes of the computational grid N G is calculated by the formula: Under the block of the computational grid G k1 \subset G (further -the block) we will understand the sub-set of nodes of the computational grid G.
where K k1 = \{ 1, ..., N k1 \} is the set of block indices G k1 of the computational grid G ; N k1 is the number of blocks G k1 , N k1 = d 2 ; K k1 , N k1 \subset N ; N is the set of natural numbers; k 1 is the block index G k1 . Since G k1 \subset G, then where g k1 i, \widetil j,k block node k 1 ; the \sim sign denotes belonging to the block; \widetil j is the block node index k 1 at coordinate y; n k1 y is the number of nodes in block k 1 at coordinate y.
where n b y is the number of nodes at coordinate y of the b-th block. Under the fragment of the computational grid G k1,k2 (further -the fragment) we will understand a subset of the nodes of the computational grid of block G k1 . where where \u g \u i, \widetil j, \u k is the fragment node; the sign \smile denotes belonging to a fragment; \u i, \u k are indexes of fragment node by coordinates x, z; \u n x , \u n z is the number of nodes of the computational grid in the fragment along the coordinates x, z.
Each index k 2 of fragment G k1,k2 is associated with a tuple of indices \langle k 3 , k 4 \rangle , designed to store fragment coordinates in plane xOz, where k 3 is the fragment index at coordinate x, k 4 is the fragment index at coordinate z.
where k 3 is the fragment index along the x coordinate; k 4 is the fragment index along the z coordinate; K k3 is the number of fragments along the Ox axis. The number of fragments G k1,k2 block G k1 is calculated by the formula where K k4 is the number of fragments at coordinate z.
where \u n b is the number of nodes in the b-th fragment. Let us introduce a set of comparisons of computational grid blocks with program currents M .
where s k1 \in S -program flow, calculating block G k1 .
Solving Grid Equations Using the Alternating-triangular Method on GPU

82
Вестник ЮУрГУ. Серия «Вычислительная математика и информатика» For the domain decomposition, it is necessary to take into account the computing performance of device, involved in calculations. Performance refers to the number of nodes of the computational grid, calculated using a given algorithm, per unit of time.
To calculate the number of nodes along the coordinate y in the blocks of the computational grid processed by GPU streaming multiprocessors, we use the formulas where n yGT is the number of computational grid nodes along coordinate y in blocks processed by GPU streaming multiprocessors, except for the last block; n yGT L is the number of nodes at coordinate y in the last block of the computational grid processed by GPU streaming multiprocessors.
The number of the computational grid fragments along the coordinate y is equal to Let the number of fragments be N f x and N f z at coordinates x and z, respectively. Then, the number of nodes of the computational grid along the coordinate x is calculated by the formulas: where n f x is the number of nodes of the computational grid along the coordinate x in all fragments except the last one; n f L x -the number of nodes of the computational grid along the coordinate x in the last fragment.
Similarly, the number of nodes of the computational grid along the coordinate z is calculated where n f z is the number of nodes of the computational grid along the coordinate z in all fragments except the last one; n f L z -the number of nodes of the computational grid along the coordinate z in the last fragment.
Let on M it is necessary to organize a parallel process for calculation some function F , and the calculations in each fragment G k1,k2 depend on the values in neighboring fragments, each of which has at least one of the indices at coordinates x, y and z one less than the current one.
To organize a parallel-pipelining method, let us introduce a set of tuples A that define correspondences a between program flows s k , processing fragments G k1,k2 , and the numbers of steps of the parallel-pipelining method r.
where r = 1, N r is the step number of the parallel-pipeline method, N r is the number of steps of the parallel-pipeline method, calculated by the formula Full download of all calculators in the proposed parallel-pipeline method starts from step r 100ST ART = N f y and ends at the step r 100ST OP = N f x N f z . In this case, the total number of steps The calculation time of some function F by the parallel-pipeline method can be written in the form where T a is a vector of time values for fragment processing in parallel mode.

Parallel Implementation
The numerical implementation of the MATM for solving SLAE with the high dimension is based on the developed parallel algorithms that implement the pipeline computing process. The use of these algorithms allows to fully utilize all available streaming multiprocessors of graphics accelerator.
A class library was developed in C++ for describing the domain decomposition. The class library contains the following classes: \bullet Grid3D, describes the parameters of the computational grid (number of nodes n x , n y , n z , and step sizes h x , h y , h z in spatial coordinates) and contains an array of objects of the GridBlock3D class.
\bullet GridBlock3D, describes the parameters of the computational grid block and contains an array of objects of the GridFragment3D class.
\bullet GridFragment3D, describes the parameters of a computational grid fragment and contains data arrays. The organization of calculations is performed by an algorithm that controls all available streaming multiprocessors of GPU (calculators). Each calculator performs calculations only for its own block of the computational domain. For this, the computational domain is divided into blocks that are assigned to individual calculators (Fig. 1). Next, each block is divided into fragments. Notations in Fig. 1: SM 1 , SM 2 , SM 3 are streaming multiprocessors of GPU.
A graph model was used to describe the relationships between adjacent fragments of the computational grid and the organization of the pipeline calculation process (Fig. 2). Each graph node is an object of a class GridFragment3D that describes a fragment of the computational domain. This class contains the following fields: the dimensions of the fragment along the Ox, Oy, and Oz axes; the index of the zero node of the fragment in the global computational domain; pointers to adjacent fragments of the computational grid; pointers to objects that describe the parameters of calculators. The computational process is a graph traversal from the root node with parallel launch of calculators that process the graph nodes in accordance with the value of the calculation step counter r.
An algorithm and its program implementation in the CUDA C language are developed to improve the calculation efficiency of the computational grid fragments assigned to the graphics accelerator [13][14][15][16][17].
We present an algorithm for searching the solution for the system of equations with the lower-triangular matrix (straight line) on CUDA C.
The input parameters of the algorithm are the vectors of the coefficients of grid equations a 0 , a 2 , a 4 , a 6 and the constant \omega . The output parameter is the vector of the water flow velocity v. Before running the algorithm, it is necessary to programmatically set the dimensions of the  Fig. 1. Decomposition of the third-dimensional computational domain CUDA computing block blockDim.x, blockDim.z according to the spatial coordinates x, z, respectively. The CUDA framework runs this algorithm for each thread, and the variable values threadIdx.x, threadIdx.z, blockIdx.x, blockIdx.z are automatically initialized by the indexes of the corresponding threads and blocks. Global thread indexes are calculated in rows 1 and 2. The row index i and the layer index k that the current thread processes are calculated in row 3. A variable j is initialized that represents a counter by coordinate y. The calculation pipeline is organized as a loop in line 4. The indexes of the central node of the grid pattern p 0 and the surrounding nodes p 2 , p 4 , p 6 are calculated in line 8. The two-dimensional array cache is located in the GPU shared memory and designed to store the calculation results on the current layer by the coordinate y. This allows us to reduce the number of reads from slow global memory and accelerate the calculation process by up to 30 \%.
The performed researches show a significant dependence of the algorithm implementation time for calculation the preconditioner on the ratio of threads in spatial coordinates. A series of experiments is preperformed to calculate the performance of calculators, which is the 95th percentile of the calculation time in terms of 1000 nodes of the computational grid.
GeForce GTX 1650 video adapter was used in experimental researches. The GeForce GTX 1650 video adapter has 4 GB of video memory, core and memory clock frequency of 1485 MHz and 1665 MHz, and a video memory bus bit rate of 128 bits. The computing part consists of 14 streaming multiprocessors (SM).
The purpose of the experiment is to determine the distribution of flows along the Ox and   The regression equation was obtained as a result of the experimental data processing (Fig. 3): where T GP U is the implementation time of one MATM step on the GPU in terms of 1000 nodes of the computational grid, ms. The determination coefficient was 0.86; a = 0.026; b = 2 \cdot 10 - 7 ; c = 16 \cdot 10 - 5 ; d = 77 \cdot 10 - 5 .
To evaluate the effectiveness of the parallel-pipeline method for solving SLAE with a lower triangular matrix, a numerical experiment was performed. The dimensions of the threedimensional uniform computational grid along the spatial coordinates x, y, and z were respectively set equal to 640, 224, and 448, respectively. The amount of video memory was 3.8 GB.
In the course of experimental researches, we changed the number of streaming multiprocessors N k1 involved in the calculations and fixed the computation time T M . if (i + j + k = s) \wedge (s < i + n 2 + k) then 6: p 0 \leftarr i + (blockDim.x + 1) \cdot j + n 1 \cdot n 2 \cdot k 7: if a0[p0] > 0 then 8: p 2 \leftarr p 0 -1; p 4 \leftarr p 0n 1 ; p 6 \leftarr p 0n 1 \cdot n 2 9: vp4 \leftarr 0 j \leftarr j + 1 For each experiment, the computational grid was divided into three-dimensional blocks and fragments. In this case, the number of blocks was set equal to the number of streaming multiprocessors. The number of fragments in blocks along spatial coordinates x, y, and z was set equal to 4, 1, and 7, respectively. The sizes of fragments along spatial coordinates x (n k1 x ) and z (n k1 z ) were set equal to 160 and 64, respectively. The acceleration S p = T M (1)/T M (i) and efficiency E p = S p (i)/N k1 (i) were calculated from the experimental data. The results of numerical experiments are shown in Tab. 1.

Conclusions
To solve grid equations using the MPTM method on the graphics accelerator, the decomposition model of computational domain has been developed. The computational domain is divided into blocks along the spatial coordinate y, and then the blocks are divided into fragments along the spatial coordinates x and z. This model allows each GPU streaming multiprocessor to map a computational domain block and organize a parallel-pipelined computational process. The graph model was proposed that describes the relationship between adjacent fragments of  the computational grid and the process of conveyor calculation. The algorithm for solving the system of equations with a lower triangular matrix in the CUDA C language was described.
As a result of the experiment, a regression model was obtained: it describes the dependence of the time for calculation one step of the MATM on the GPU. According to the regression model, at k < 10 and Y < 1000, the calculation velocity slows down, which is explained by the inefficient use of the distributed memory of the graphics accelerator.
The results of calculations of SLAE with the lower triangular matrix by the parallel-pipeline method on the GPU with using the different number of streaming multiprocessors are presented. At N k1 = 14, the acceleration S p was 9.5, and the efficiency E p was 0.668.
The reported study was funded by the Russian Science Foundation (project No. 21-71-20050).