Modern and next-generation high-performance computer systems with reconfigurable architecture

characteristics of reconfigurable computer systems (RCS): computational modules 24V7-750 and Taygeta, placed into a computational rack, and a desktop reconfigurable computational block " Celaeno ". These systems are based on field programmable gate arrays (FPGAs) of the Xilinx Virtex-7 family. In the paper we also consider architecture and assembly of next-generation RCSs with a liquid cooling system and give results of calculations and prototyping of principal technical solutions. We consider technologies of allied task solving by means of a complex of application software development tools. Next-generation RCSs with a liquid cooling subsystem provide the performance of 1 PFlops for a standard computational 47U rack with the power of 150 kWatt. Such systems have a considerable advantage of such engineering and economical parameters as real and specific performance, power efficiency, mass and dimension characteristics, etc., in comparison with similar systems.


I. INTRODUCTION
NE of the promising approaches which provide achievement of high real performance of a computer system is adaptation of its architecture to the structure of the solving task and creation of a computing device which performs structural and procedural fragments of calculations with the same efficiency.That is why domestic [1] and foreign vendors of computers use field programmable gate arrays (FPGAs) more and more frequently.FPGAs speedup calculations of computationally laborious fragments.It is possible to create stand-alone accelerators which contain one or two FPGAs, or computing complexes.Such corporations as Nallatech [2] and Pico Computing, Inc.[3]produce accelerators and base boards with small number (up to 4) of FPGAs which are used as components of servers and heterogeneous clusters Modern and next-generation high-performance computer systems with reconfigurable architecture Ilya I. Levin, Alexey I. Dordopulo, Yuri I. Doronchenko, Maxim K. Raskladkin O Figure 1-a shows the RCB Celaeno. Figure 1-b shows the open block (no top cover) and its printed circuit board.
The RCB Celaeno is produced in two modifications: Celaeno-K based on Kintex-7 XC7K160T FPGAs and Celaeno-V based on Virtex-7 XC7VX485Т FPGAs.Specifications of the RCB Celaeno of these two modifications are given in Table 1.The RCB Celaeno contains 6 FPGAs of the computational field, an embedded host-computer, a power supply system, a control system, a cooling system and other subsystems.All FPGAs of the computational field are connected according a lattice-like structure by LVDS-channels, and each FPGA is connected to its own units of dynamic memory of 256 Mbyte each.
To control and configure the computational field of the RCB an embedded computer (computer-on-module of the Kontron COM-Express family) is used.It is placed on the printed circuit board of the computational module.It provides connection with peripheral devices, development and debugging of parallel applications of computationally laborious tasks, generation of initial data files, which, together with the executable file of the application, are loaded into the computational field via PCI-Express bus and LVDS-channel.When the task is done, its results are transferred into the COM-Express processor unit.
Possible areas of application of the RCB Celaeno are symbolic processing, mathematical physics, simulation and computational experiment, digital signal processing, linear algebra, etc.

B. Reconfigurable
According to the state contract №14.527.12.0004 from 03.10.2011 the scientific team of SRI MCS SFU designed a reconfigurable computer system RCS-7 based on Virtex-7 FPGAs, which contains a computational field of 576 Virtex-7 XC7V585T-FFG1761 FPGAs (58 million of equivalent gates each), assembled into one 47U computational rack with the peak performance of 10 15 fixed-point operations per second.The principal structural component of the RCS-7, intended for placement into a standard 19'' computational rack, is a computational module (CM) 24V7-750 (CM Pleiad), which contains 4 boards of the computational module (BCM) 6V7-180 (see Fig. 2); a control unit CU-7; a power supply subsystem; a cooling subsystem, and other subsystems.Fig. 2 shows the CM 24V7-750.
Each board of the CM 24V7-750 contains 6 Virtex-7 XC7V585T-1FFG1761 FPGAs of the computational field, connected sequentially, and 12 MT47H128M16HR-25E chips of distributed dynamic memory, organized as 128 М*16 with read/write frequency up to 400 MHz.The total size of distributed dynamic memory is 12 GByte.Data can be transferred between the FPGAs via 144 LVDS differential lines at frequency of 800 MHz.The performance of the one board is 645.9GFlops for processing of 32-digit floating point data, and the performance of the CM 24V7-750 is 2.58 TFlops for processing of 32-digit floating point data.

C. Reconfigurable computational module Taygeta
The scientific team of SRC SC & NC has designed a 19" 2U computational module Taygeta, based on Virtex-7 FPGAs and intended for high-performance multirack RCSs.Fig. 3-а shows the CM Taygeta, which contains 4 boards 8V7-200, an embedded host-computer, a power supply system, a control system, a cooling system, and other subsystems.The boards of the CM Taygeta are connected by LVDS-channels, running at frequencies up to 1000 MHz.Fig. 3-b shows the board 8V7-200.
The board of the computational module (BCM) 8V7-200 is a 20-layer printed circuit board with double-side mounting of elements.It contains 8 XC7VX485T-1FFG1761 FPGAs (48.5 million equivalent gates each), 16 chips of distributed memory DDR2 SDRAM with total capacity of 2 GByte, LVDS and Ethernet interfaces, and other components.
The performance of one BCM 8V7-200 is 667 GFlops for processing of 32-digit floating point data, and the performance of the CM Taygeta is 2.66 TFlops, respectively.

D. RCS based on CM Pleiad and CM Taygeta
On the base of already considered CM Pleiad, in 2013 we had designed a reconfigurable computer system RCS-7 (Fig. 4-a), which contained 24 computational modules, and which can be extended up to 36 computational modules.The performance of RCS-7, when it contains from 24 to 36 24V7-750 CMs is from 62 to 93 TFlops for processing of 32-digit floating point data, and 19.4÷29.4TFlops for processing of 64-digit floating point data, respectively.

A. Reconfigurable computational block based on UltraScale FPGAs
The designed RCB Celaeno-U will also be produced as a 1U block, but in contrast to its precursors Celaeno-K and Celaeno-V, it will contain 4 Xilinx UltraScale XCVU095 FPGAs (95 million equivalent gates each), which will create a computational field of 380 million equivalent gates in total.Fig. 5 shows the structure chart of the RCB Celaeno-U and assembly outline of the board.
In comparison with the previous version of the RCB Celaeno-V the performance of the RCB Celaeno-U will increase in 1.7-1.8times while its power will grow not more than in 1.3 times.

B. RCS with liquid cooling based on UltraScale FPGAs
The time of air cooling systems, used in modern highperformance computer systems and supercomputers, designed on their basis, including reconfigurable supercomputers, is practically over.The majority of computer designers are oriented to liquid cooling systems which will help to solve problems of cooling of the designed computer complexes.It is reasonable to use liquid cooling, particularly submersion of boards of computational modules into a liquid cooling agent (mineral oil), for computational modules of RCSs designed on the base of next-generation FPGA families.
The direction of design of next-generation RCSs based on liquid cooling is actively developed in SRC SC & NC.New designs of printed boards and computational modules with high board density are designed.Specifically, next-generation computational modules Scate-8 for multirack RCSs of superhigh performance are designed at present.
The board of the next-generation computational module contains 8 VirtexUltraScale FPGAs (not less than 100 million equivalent gates each).The computational module consists of two sections: the first section contains 16 boards of the computational module with the power of up to 800 Watt each, completely submerged into electrically neutral liquid cooling agent.The second section contains a pump system and a heat-transfer device, which provide flow and cooling of the cooling agent.Fig. 6a shows the 3U CM outline.
According to performed analysis, use of liquid cooling and creation of computer systems on the base of the CM Scate-8 provide more than petaflops-like performance of a single computational rack of the RCS.The computational 19" rack of the supercomputer can contain up to 12 CM Scate-8 with liquid cooling.Fig. 6-b shows the outline of the rack.Table 2 contains the performance and the power of the next-generation RCS.In 2015-2016 on base of the described design we will create super-high-performance computer complexes with effective cooling of computational FPGAs both of the UltraScale family and of the next-generation FPGA family.

IV. RCS SOFTWARE
At present there are plenty of various development suits for development of structural solutions of applied tasks for FPGAs.The most popular suits which can be used as separate development tools and as parts of some complexes are synthesizers, developed by FPGA vendors: ISE and Vivado (Xilinx, Inc.) [8], Quartus II (Altera Corporation) [9] and ActelLibero IDE (Actel Corporation) [10].These software tools, besides the development environment of digital devices, contain a number of utilities: analyzers of timing characteristics, placing editors, FPGA programming units, systems of digital device simulation, etc. Owing to a wide range of tools these development suits provide a complete cycle of digital device development within single FPGA: development of the initial description of the project, synthesis, simulation, placement, tracing, chip programming.
Continuous growth of FPGA capacity makes design of applied task solutions for FPGAs by means of hardware description languages (VHDL, AHDL, Verilog, etc.) [11]anddesign of digital devices by means of graphic editors more and more laborious.That is why at present the leading vendors of FPGAs and reconfigurable computers are oriented to high-level languages.As a result, the new development environment Vivado by Xilinx, Inc. contains a new design tool Vivado HLS, based on a high-level language.The development kit Altera SDK [12], used for Altera FPGA design, contains tools for a new standard OpenCL of parallel programming of heterogeneous systems.These solutions use translators of С-like languages, which generate code in the hardware description languages on the register transfer level (RTL, C-to-RTL) from the program in some C-like high-level programming language.
In spite of similarity of syntaxes of C-like languages with the C language, such approach does not mean that initial Ccode, developed for a PC or a cluster computer system will be correctly interpreted by C-to-RTL translators.The language C was chosen as a basic one because of its wide popularity, which makes mastering of new FPGA application development and design tools much easier.
In addition, when we use C-to-RTL translators the whole application or its explicitly selected procedures are translated into RTL-descriptions of single FPGAs.Such development suits have no tools of automatic decomposing of the parallel program into fragments for a set of interconnected FPGAs.
When we use Vivado HLS, the project is designed within one FPGA, and if the application developer needs hardware resource more than the resource of one FPGA, then he himself must distribute calculations between several projects for each FPGA and synchronize control and data streams between them.
The OpenCL standard is used by the company Nallatech (vendor of reconfigurable computers) and allows use of several FPGAs in one project.In this case solutions in FPGAs are programmed by means of functions, called from the library of tools of Altera SDK.Each FPGA involved in computational process performs calculations described by a certain fragment of code.So, the program written according to the OpenCL standard is a basic code, written for traditional processors, and some separate fragments of code, written for FPGAs, involved into computational process as co-processors.In this case the problem of data synchronization is responsibility of the programmer.
Another well-known FPGA programming tool is a complex created by the company Mitrionics Inc., which contains a Mitrion Virtual Processor (MVP), programmed by means of the high-level programming language Mitrion-C, and a library of functions MithalAPI included in the development kit Mitrion SDK [13] for development of host-programs.The developed Mitrion-C program must be completely realized on a single virtual processor MVP.It is impossible to program multichip RCSs, and as a result, it considerably reduces effectiveness of the software complex of the company Mitrionics Inc.To program multichip RCSs which consist of interconnected FPGAs the programmer himself must realize an interface (protocol) of data exchange between FPGAs and solve problems concerning data flow synchronization.In this case the RCS program degenerates into a program for a cluster (a set of MVP), implemented in FPGAs, and it considerably reduces effectiveness of tasks realized on multichip RCSs.

V. LANGUAGE COLAMO AND SOFTWARE COMPLEX FOR
MULTICHIP RCS An alternative approach to RCS programming is suggested in SRI MCS SFU which deals with design of multichip reconfigurable computer systems of various architectures and configurations for more than 15 years.
The experience of SRI MCS SFU in solving problems of various types has proved that effective solving of modern laborious problems requires programming tools which can provide: -programming in a high-level programming language; -support of multichip programming; -high operating frequency of FPGAs; -high density of placement in FPGAs; -support of pipeline and macropipeline organization of calculations.
Specialists of SRI MCS SFU have developed and widely used a software complex, which consists of: -a translator of the programming language COLAMO, which translates of the initial code written in COLAMO into an information graph of a parallel application; -a synthesizer Fire!Constructor of scalable circuit solutions on the level of FPGA logic gates, which maps the information graph, generated by the translator of the COLAMO-language, on an RCS architecture, places the mapped solution into FPGAs and provides automatic synchronization of the fragments of the information graph in different FPGAs; -a library of IP-cores, which correspond to operators of the COLAMO-language (self-contained structurally implemented hardware devices) for various problem domains, and interfaces which match the rate of data and connect all components into a single computing structure; -debugging tools, access tools, and tools of monitoring of RCS condition.
The high-level language COLAMO is intended for description of the parallel algorithm and creation of a specialpurpose computing structure, generated according to the principles of structural procedural organization of calculations [1,14,15], within the RCS architecture.Such computing structure implies sequential change of structurally (hardwarily) implemented fragments of the information graph of the task.Each fragment is a computational data flow pipeline.So, the application (applied task) for the RCS consists of the structural component, represented as a set of hardwarily implemented fragments of calculations, and of the procedural component, represented as a control program of sequential change of computing structures and organization of dataflows.The control component is, one and the same for all structural fragments.To provide such organization of calculations the programming language contains such structure as "cadr".A cadr is a program-indivisible component, a set of operators implemented as arithmetic-logic instructions and read/write instructions, performed on various functional devices, interconnected according to the information structure of the algorithm.
The language COLAMO has no explicit forms of parallelism description.Parallelization is provided by declaration of types of access to variables and by indexing of array items, which is typical for data flow languages.To address to data it is possible to use two principal access methods: parallel access (declared by Vector type) and sequential access (declared by Stream type).The degree of parallelism is defined according to the minimal value of the parameter of parallelization.For Stream type the degree of parallelism is 1, and for Vector type it is defined according to the minimal value of Vector type of each array, involved in computing process.For parallel type of access it is possible to process concurrently all dimensions of arrays, declared as Vector.In this case the hardware resource, needed for calculations, will grow, but the processing time will drop down.
Multidimensional data arrays can have plenty of dimensions.Each dimension can have sequential or parallel access type, declared by keywords Stream or Vector, respectively.Change of access type allows very simple control of the degree of parallelism of calculations on the level of data structure description, the processing rate, and the occupied resource.Owing to this, the programmer can describe various types of parallelism in a rather short form.
Besides the access type, the variable in the language COLAMO also has type of storage: memory (Mem), register (Reg) and commutation (Com).
The memory variable is stored in a cell of distributed memory, and hence it keeps its value till the next reassignment.For the memory variable it is possible to perform only one process at the same time.That is why, according to semantics of the COLAMO language, in any cadr any memory variable complies with two rules: the singleassignment rule and the rule of single substitution.The singleassignment rule means that the memory variable changes its value only once in the cadr.The rule of single substitution means that the variable in the cadr can be used for only one process of reading or writing.
To describe connections between the elements of the information graph of the task the COLAMO language has switching variables.Since the switching variable describes information connections, it requires no computational hardware resource for itself.It is impossible to get access to the value of the switching variable when the cadr is done.The translator needs switching variables to define information dependencies during generation of the computing structure of the task.As memory variables, switching variables comply the single-assignment rule, but not the rule of single substitution.Owing to use of switching variables data flows can be easily forked and duplicated, but it is impossible to create recursion.
To realise recursion the COLAMO language has a register variable, which is a hardware register used to store intermediate data, received during computational process.The single-assignment rule is the only restriction for register variables in the cadr.
To translate the program written in the high-level language COLAMO means to generate a circuit configuration of the computer system (a structural component) and a parallel program which controls data flows (a stream component and a procedural component) [1,14.15].To generate the structural component means to create a computational graph which corresponds to information dependencies between results of calculations.In this case for each operation, used in the program, a specialized computing unit is substituted according to data access, data types, their capacity, etc.The synthesized information graph of the task is transferred to the synthesizer Fire!Constructor for mapping on the multichip RCS hardware resource [16].
The problem of automatic mapping of the parallel program on the multichip RCS hardware resource consists of three steps: partition of the information graph into disjoint subgraphs, placement of the subgraphs into RCS FPGAs, and tracing of external connections of the placed subgraphs within the RCS communication system.
The result of the synthesizer Fire!Constructor is a set of files of VHDL-descriptions, time constraints, and user constraints.VHDL-files describe structural implementation of the fragments of the parallel program.These files and the library of circuit components are the basis for projects, created by the synthesizer ISE for each single FPGA.Then the synthesizer ISE generates FPGA bitstream files which are loaded into the RCS.
The COLAMO-application is developed within a single project and can be translated for any RCS, which has a description and all required libraries, included into the RCS software suit.In contrast to other existing RCS application development suits, the programmer has no need to define in the text of the program, which fragments and in which FPGAs will be performed.The synthesizer Fire!Constructor provides splitting of the computing structure of COLAMO-program into several projects by means of the synthesizer Xilinx ISE, and, in addition, it provides synchronization of data flows both inside each FPGA and between them.

VI. CONCLUSION
According to Table 3, FPGAs as principal components of reconfigurable supercomputers provide a permanent, practically linear growth of the RCS performance and give new prospects of creation of supercomputers of petaflops performance.It is possible to claim that design solutions used for the next-generation computational modules, based on Xilinx VirtexUltraScale FPGAs, will help to concentrate a powerful computational resource in a single 47U computational rack and to provide the specific performance of the RCS, based on Xilinx VirtexUltraScale FPGAs, on the level of the best world characteristics for cluster supercomputers.Owing to this, UltraScale-based RCSs can be considered as a basis for the next-generation high-performance computer complexes, which provide high efficiency of calculations and practically linear growth of performance for extending computational resource.

Table 1 .
Specifications of the RCB Celaeno-K and Celaeno- . 2. Computational module (CM) 24V7-750 (a -boards of the CM Pleiad, b -CM Pleiad with no top cover/with a top cover) Fields of application of RCS-7 and RCS-7-based computer complexes are digital signal processing and multichannel digital filtering (Ali M. Reza, 2013; Mazher et al., 2013).Fig. 4-b shows an RCS, designed on the base of the CM Taygeta.The performance of its one rack, which contains 18 CMs Taygeta is 48 TFlops for processing of floating point data with single precision, and 23 TFlops for processing of 64digit floating point data.High-performance RCSs based on the CM Taygeta are intended for solving computationally laborious problems of science and industry, drug design and symbolic processing, and for such problems they provide a significant advantage of the majority of technical and economical parameters such as specific performance, power efficiency, etc., in comparison with cluster-type multiprocessor computer systems.III.NEXT-GENERATION RECONFIGURABLE SYSTEMS BASED ON XILINX ULTRASCALE FPGAS Further development of open scalable architecture (Levin, 2010), used for design of RCSs based on Xilinx Virtex-7 FPGAs, is a variety of next-generation components for new designed products -Xilinx FPGAs of a new generation family UltraScale, based on 20 nanometer technol.In comparison with FPGAs of Virtex-7 family they have lower power consumption and higher performance.
Fig. 3.The CM Taygeta (a -the CM Taygeta without top cover, b -BCM 8V7-200) а)b) Fig. 6.The outline of the computer system based on liquid cooling (a -the outline of the CM Scate-8,b -the outline of the Scate-8 based computational rack) а) b) Fig. 5. RCB Celaeno-U (a -structure chart, b -assembly outline of the RCB board)

Table 2 .
The performance and power of the next-generation RCS on the base Xilinx UltraScale FPGAs

Table 3 .
Performance of reconfigurable supercomputersOn basis of reconfigurable systems, produced in SRC SC & NC, it is possible to watch growth rates of RCS performance when the FPGA family is changed.