Principle Investigator:

Dr. Sheldon Tan (PI)

Department of Electrical and Computer Engineering,

University of California – Riverside

Email: stan@ece.ucr.edu

Web: www.ece.ucr.edu/~stan

Graduate Students:

· Shaoyi Peng,

· Kai He,

· Hengyang Zhao

Funding:

· National Science Foundation, “SHF: Small: GPU-Based Many-Core Parallel Simulation of Interconnect and High-Frequency Circuits”, (CCF-1017090), 9/1/2010-8/30/2014, PI: Sheldon Tan

We thank National Science Foundations for supports of this research work. Any opinions, findings, and conclusions or recommendations expressed in those materials below are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project Descriptions

Transforming a sparse matrix into its LU form is of crucial importance in linear algebra as it plays an important role in many numerical and scientific computing applications such as finite difference and finite element based methods. LU factorization operation represents the dominant computing cost in those problems and it is very important to improve the efficiency of the LU factorization algorithms. LU factorization for sparse matrices is the most important computing step for general circuit simulation problems for circuit designs. But parallelizing LU factorization on the popular many-core platforms such as Graphic Processing Units (GPU) turns out to be a difficult problem due to intrinsic data dependency and irregular memory access, which diminish GPU computing power.

Modern computer architecture has shifted towards the multi-core processor and many-core architectures. The family of GPU is among the most powerful many-core computing systems in mass-market use. For instance, the start-of-the-art GPU provides massive and fine-grain parallelism with orders of magnitude higher throughput than the CPUs. For instance, the state-of-the-art NVIDIA Tesla V100 GPU with 5120 cores has a peak performance of over 15~TFLOPS versus about 200--400~GFLOPS of Intel i9 series 8 core CPUs. In addition to the primary use of GPUs in accelerating graphics rendering operations, there has been considerable interest in exploiting GPUs for general purpose computation (GPGPU).

GLU was GPU-accelerated parallel sparse LU factorization solver developed by VSCLAB at UC Riverside. GLU is based on a hybrid right-looking LU factorization algorithm. We show that more concurrency can be exploited in the right-looking method than the left-looking method, especially on GPU platforms. GLU has the following features and advantages over existing sparse LU solvers:

· It is based on the hybrid column-based right-looking LU factorization method, which is shown to be more amenable for exploiting the concurrency of LU factorization. The new method preserves the benefit of column-level concurrency and symbolic analysis in the left-looking method meanwhile, it allows more parallelism to be exploited in GPUs.

· GLU solver allows the parallelization of all three loops in the LU factorization on GPUs. In contrast, the existing GPU-based left- looking LU factorization approach can only allow two-level parallelization.

For GLU 1.0, We conduct comprehensive studies on the new GPU LU solver on a number of published matrices, circuit matrices, self-made large circuit matrices against some existing LU solvers such as PARDISO, KLU, UMFPACK to demonstrate the advantage of the proposed parallel LU solver. In the tables, GPU-LL is a recently published GPU-accelerated left-looking sparse LU solver.

Numerical results show that the proposed GLU 1.0 solver can deliver 5.71X and 1.46X speedup over single-threaded and the 16-threaded PARDISO solver respectively, 19.56X speedup over the KLU solver, 47.13X over the UMFPACK solver and 1.47X speedup over a recently proposed GPU-based left-looking LU solver on the set of typical circuit matrices from University of Florida Sparse Matrix Collection (UFL). Furthermore, we also compared the proposed GLU solver on a set of general matrices from UFL, GLU achieves 6.38X and 1.12X speedup over the single-threaded and the 16-threaded PARDISO solver respectively, 39.39X speedup over the KLU solver, 24.04X over the UMFPACK solver and 2.35X speedup over the same GPU-based left-looking LU solver. Also comparison on self-generated RLC mesh networks shows a similar trend, which further validates the advantage of the proposed method over the existing sparse LU solvers.

In 2017, GLU 2.0 was released, GLU 2.0 fixed a data dependency bug (we called double-U column dependency, which is unique to the hybrid right-looking LU factorization used in the GLU solver. However this bug fix (along with a few bug fixes) degraded the performance of GLU solver.

Recently, Lee et al. proposed an enhanced GLU2.0 solver [J2], which considers the column count difference in different level so that some advanced GPU features such

as dynamic kernel launch, pipeline based kernel computing was exploited to further improve the GLU kernel efficiency. But it still uses the fixed GPU resource allocation method from GLU2.0 for each kernel launch.

Recently, we launched the GLU 3.0, which significant enhanced version of GLU 2.0. It has following improvements [J3]: (Please see J3 for more details).

· First, to mitigate the slow process to detect the new double-U data dependency in existing GLU2.0 solver, the new GLU introduces a much more efficient double-U dependency detection algorithm. The new detection method is based on the observation that we just need to find sufficient data dependencies , while necessity is not a must. Such relaxed double-U dependency bows down to find the extra data dependency in L matrix, which is much more efficient than the previous solution with little impacts on the GLU performance.

· Second, we observed that number of columns and its associated subcolumn (updates) of each column, which are important parallel computing task units, are reversely correlated based on circuit matrices we analyzed. As a result, we can use the number of columns as a good metric for resource allocation. We develop three different modes of GPU kernel to adapt to different computing stages to accommodate the computing task changes in the factorization. As a result, the new GLU can dynamically allocate GPU block and wraps based the number of columns in a level to better balance the computing demands and resources in GLU during the LU factorization process.

As a result, GLU 3.0 can deliver 2-3 orders of magnitude speedup over GLU2.0 for the data dependency detection. Furthermore, GLU3.0 consistently outperforms both GLU 2.0 and the recently proposed enhanced GLU2.0 sparse LU solver on the same set of circuit matrices. Furthermore, GLU3.0 achieve 13.0X (arithmetic mean) and 6.7X (geometric mean) speedup over GLU 2.0 and 7.1X (arithmetic mean) and 4.8X (geometric mean) over recent proposed enhanced GLU2.0 sparse LU solver on the same set of circuit matrices.

Software Download

GLU solver version 3.0 (with dynamic resource allocation kernel), which includes all the sources codes, user manual and some examples, are available at Github now.

Please send any problem, bug and comment regarding GLU to Prof. Sheldon Tan at stan@ece.ucr.edu.

Relevant Publications

J1. K. He, S. X.-D. Tan, H. Wang and G. Shi, “GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), vol. 24, no.3, pp.1140-1150, March 2016.

J2. Wai-Kong Lee, Ramachandra Achar, and Michel S Nakhla, “Dynamic GPU parallel sparse LU factorization for fast circuit simulation”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (99):1–12, 2018.

J3. S. Peng and S. X.-D. Tan, “GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation”, available at http://arxiv.org/abs/1908.00204