The Fastest and Energy-Efficient Breadth-First Search Algorithm on a Single Node with Various Parallel Architectures According to Graph500

Alexander S. Kolganov

Abstract


The breadth-first search algorithm is one of the basic algorithms for graph traversing and is used in many other algorithms of high-level graph analysis. BFS is characterized with intensive irregular memory accesses and a random data dependency, which greatly complicates its parallelization to all existing architectures. The paper considers the implementation of the BFS algorithm (the core benchmark of the Graph500 rating) for processing large graphs on different architectures: Intel x86, IBM Power8+, Intel KNL and NVidia GPU. Algorithms for implementing breadth-first search will be considered, such as top-down traverse, bottom-up traverse, and a hybrid traverse that includes both top-down and bottom-up traverses. The features of the algorithm implementation on shared memory will be shown, as well as graph transformations (local sorting of graph vertices, global sorting of graph vertices, renumbering of all graph vertexes, compressed representation of graph vertices), which allow achieving the best performance and energy-efficiency in this algorithm among all single-node systems of Graph500 and GreenGraph500 ratings.

Keywords


parallel graph processing; BFS; CUDA; Power8; KNL; Graph500

References


Cherkassky B.V., Goldberg A.V., Radzik T. Shortest Paths Algorithms: Theory and Experimental Evaluation. Math. Program. 1996. vol. 73. pp. 129–174. DOI: 10.1007/BF02592101.

Moore E.F. The Shortest Path Through a Maze. In Int. Symp. on Th. of Switching. 1959 pp. 285–292.

Chazelle B.A., Minimum Spanning Tree Algorithm with Inverse-Ackermann Type Complexity. Journal of the ACM. 2000. vol. 47, no. 6. pp. 1028–1047.

Kolganov A.S. Parallel Implementation of Minimum Spanning Tree Algorithm on CPU and GPU. Vestnik Yuzho-Uralskogo gosudarstvennogo universiteta. Seriya: Vychislitel’naja matematika i informatika [Bulletin of South Ural State University. Series: Computational Mathematics and Software Engineering]. 2016, vol. 5, no. 3. pp. 5–19. DOI: 10.14529/cmse160301 (in Russian).

Graph500. Available at: http://graph500.org/ (accessed 01.12.2017)

GreenGraph500. Available at: http://green.graph500.org/ (accessed 01.12.2017)

Cormen T., Leiserson C., Rivest R. Introduction to Algorithms. MIT Press, Cambridge. 1990.

Edmonds J., Karp R.M. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. Journal of the ACM. 1972. vol. 19, no. 2. pp. 248–264. DOI: 10.1007/3-540-36478-1_4.

Brandes U. A Faster Algorithm for Betweenness Centrality. J. Math. Sociol. 2001. vol. 25, no. 2. pp. 163–177. DOI: 10.1080/0022250X.2001.9990249.

Frasca M., Madduri K., Raghavan P. NUMA-Aware Graph Mining Techniques for Performance and Energy Efficiency. Proc. ACM/IEEE Int. Conf. High Performance Computing, Networking, Storage and Analysis (Salt Lake City, Utah, USA, November 10–16, 2012). pp. 1–11. DOI: 110.1109/SC.2012.81.

Girvan M., Newman M.E. Community Structure in Social and Biological Networks. Proc. Natl. Acad. Sci. (USA, June 11, 2002). vol. 99, no. 12. pp. 7821 7826. DOI: 10.1073/pnas.122653799.

Dijkstra E.W. A Note on Two Problems in Connexion with Graphs. Numerische Mathematik. 1959. vol. 1, no. 1. pp. 269–271. DOI: 10.1007/BF01386390.

Top500. Available at: https://www.top500.org/ (accessed 01.12.2017)

Bader D.A., Madduri K. Designing Multithreaded Algorithms For Breadth-First Search and St-Connectivity on the Cray MTA-2. 2006. pp. 523–530. DOI: 10.1109/ICPP.2006.34.

Korf R.E., Schultze P. Large-Scale Parallel Breadth-First Search. AAAI. 2005. pp. 1380–1385.

Yoo A., Chow E., Henderson K., McLendon W., Hendrickson B., Catalyurek U. A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L. Proceedings of the 2005 ACM/IEEE Conference on Supercomputing (Seattle, Washington, USA, November 12–18, 2005). DOI: 10.1109/SC.2005.4.

Zhang Y., Hansen E.A. Parallel Breadth-First Heuristic Search On A Shared-Memory Architecture. In AAAI Workshop on Heuristic Search, Memory-Based Heuristics and Their Applications. 2006.

Yasui Y., Fujisawa K., Sato Y. Fast and Energy-efficient Breadth-First Search on a Single NUMA System. Lecture Notes in Computer Science. 2014. vol. 8488. pp. 365–381. DOI: 10.1007/978-3-319-07518-1_23.

Hiragushi T., Takahashi D. Efficient Hybrid Breadth-First Search on GPUs. Lecture Notes in Computer Science. 2013. vol. 8286. pp. 40–50. DOI: 10.1007/978-3-319-03889-6_5.

Merrill D., Garland M., Grimshaw A. Scalable GPU graph traversal. Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (New Orleans, Louisiana, USA, February 25–29, 2012). pp. 117–128. DOI: 10.1145/2370036.2145832.

Chakrabarti D., Zhan Y., Faloutsos C. R-MAT: A Recursive Model for Graph Mining. Proceedings of the 2004 SIAM International Conference on Data Mining (Florida, USA, April 22–24, 2004). pp. 442–446. DOI: 10.1137/1.9781611972740.43.

Pissanetzky S. Sparse Matrix Technology. Academic Press. 1984.

Posledovatel’naja realizacija BFS. Available at: https://en.wikipedia.org/wiki/ Breadth-first_search (accessed 01.12.2017)

CUDA Dynamic Parallelism. Available at: https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/ (accessed 01.12.2017)

Intel Xeon Phi 7250. Available at: https://ark.intel.com/ru/products/94035/Intel-Xeon-Phi-Processor-7250-16GB-1_40-GHz-68-core (accessed 01.12.2017)

Intel Xeon E5 2699 v3. Available at: https://ark.intel.com/ru/products/81061/Intel-Xeon-Processor-E5-2699-v3-45M-Cache-2_30-GHz (accessed 01.12.2017)

IBM Power8 and NVidia Tesla P100. Available at: https://www-03.ibm.com/systems/ru/power/hardware/s822lc-hpc/ (accessed 01.12.2017)

Nvidia NVLink. Available at: http://www.nvidia.com/object/nvlink.html (accessed 01.12.2017)




DOI: http://dx.doi.org/10.14529/cmse180201