GLU
– GPU-Accelerated Sparse Parallel LU Factorization Solver
Version
3.0 (with dynamic resource allocation kernel)
[Last
update: Friday, October 2, 2020]
Dr.
Sheldon
Tan (PI)
Department
of Electrical and Computer Engineering,
University
of California – Riverside
Email:
stan@ece.ucr.edu
· Shaoyi Peng,
· Kai He,
· Hengyang Zhao
· National Science Foundation, “SHF: Small: GPU-Based
Many-Core Parallel Simulation of Interconnect and High-Frequency Circuits”,
(CCF-1017090), 9/1/2010-8/30/2014, PI: Sheldon Tan
We
thank National Science Foundations for supports of this research
work. Any opinions, findings, and conclusions or recommendations
expressed in those materials below are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation.
Transforming
a sparse matrix into its LU form is of crucial importance in linear algebra as
it plays an important role in many numerical and scientific computing
applications such as finite difference and finite element based methods. LU factorization operation represents the
dominant computing cost in those problems and it is very important to improve
the efficiency of the LU factorization algorithms. LU factorization for sparse matrices is the
most important computing step for general circuit simulation problems for
circuit designs. But parallelizing LU
factorization on the popular many-core platforms such as Graphic Processing
Units (GPU) turns out to be a difficult problem due to intrinsic data
dependency and irregular memory access, which diminish GPU computing power.
Modern
computer architecture has shifted towards the multi-core processor and
many-core architectures. The family of GPU is among the most powerful many-core
computing systems in mass-market use.
For instance, the start-of-the-art GPU provides massive and fine-grain
parallelism with orders of magnitude higher throughput than the CPUs. For
instance, the state-of-the-art NVIDIA Tesla V100 GPU with 5120 cores has a peak
performance of over 15~TFLOPS versus about 200--400~GFLOPS of Intel i9 series 8
core CPUs. In addition to the primary use of GPUs in accelerating graphics
rendering operations, there has been considerable interest in exploiting GPUs
for general purpose computation (GPGPU).
GLU was GPU-accelerated parallel sparse LU factorization
solver developed by VSCLAB at UC Riverside.
GLU is based on a hybrid right-looking LU factorization algorithm. We
show that more concurrency can be exploited in the right-looking method than
the left-looking method, especially on GPU platforms. GLU has the following
features and advantages over existing sparse LU solvers:
· It is based on the hybrid column-based right-looking LU factorization method, which is shown to be more amenable for exploiting the concurrency of LU factorization. The new method preserves the benefit of column-level concurrency and symbolic analysis in the left-looking method meanwhile, it allows more parallelism to be exploited in GPUs.
· GLU solver allows the parallelization of all three loops in the LU factorization on GPUs. In contrast, the existing GPU-based left- looking LU factorization approach can only allow two-level parallelization.
For GLU
1.0, We conduct comprehensive studies on the new GPU LU solver on a number of
published matrices, circuit matrices, self-made large circuit matrices against
some existing LU solvers such as PARDISO, KLU, UMFPACK to demonstrate the
advantage of the proposed parallel LU solver. In the tables, GPU-LL is a recently published
GPU-accelerated left-looking sparse LU
solver.
Numerical results show that the
proposed GLU 1.0 solver can deliver 5.71X and 1.46X speedup over
single-threaded and the 16-threaded
PARDISO solver respectively, 19.56X speedup over the KLU solver, 47.13X
over the UMFPACK solver and 1.47X speedup over a recently proposed GPU-based
left-looking LU solver on the set of typical circuit matrices from University
of Florida Sparse Matrix Collection (UFL). Furthermore, we also compared the
proposed GLU solver on a set of general matrices from UFL, GLU achieves 6.38X
and 1.12X speedup over the single-threaded and the 16-threaded PARDISO solver
respectively, 39.39X speedup over the KLU solver, 24.04X over the UMFPACK
solver and 2.35X speedup over the same GPU-based left-looking LU solver. Also comparison on
self-generated RLC mesh networks shows a similar trend, which further validates
the advantage of the proposed method over the existing sparse LU solvers.
In 2017, GLU 2.0 was released,
GLU 2.0 fixed a data dependency bug (we called double-U column dependency,
which is unique to the hybrid right-looking LU factorization used in the GLU
solver. However this bug fix (along with a few bug
fixes) degraded the performance of GLU solver.
Recently, Lee et al.
proposed an enhanced GLU2.0 solver [J2], which considers the column count
difference in different level so that some advanced GPU features such
as dynamic kernel launch,
pipeline based kernel computing was exploited to further improve the GLU kernel
efficiency. But it still uses the fixed GPU resource allocation method from GLU2.0
for each kernel launch.
Recently, we launched the GLU
3.0, which significant enhanced version of GLU 2.0. It has following
improvements [J3]: (Please see J3 for more details).
· First, to mitigate the slow process to detect the new double-U data dependency in existing GLU2.0 solver, the new GLU introduces a much more efficient double-U dependency detection algorithm. The new detection method is based on the observation that we just need to find sufficient data dependencies , while necessity is not a must. Such relaxed double-U dependency bows down to find the extra data dependency in L matrix, which is much more efficient than the previous solution with little impacts on the GLU performance.
· Second, we observed that number of columns and its associated subcolumn (updates) of each column, which are important parallel computing task units, are reversely correlated based on circuit matrices we analyzed. As a result, we can use the number of columns as a good metric for resource allocation. We develop three different modes of GPU kernel to adapt to different computing stages to accommodate the computing task changes in the factorization. As a result, the new GLU can dynamically allocate GPU block and wraps based the number of columns in a level to better balance the computing demands and resources in GLU during the LU factorization process.
As a result, GLU 3.0 can deliver
2-3 orders of magnitude speedup over GLU2.0 for the data dependency detection.
Furthermore, GLU3.0 consistently outperforms both GLU 2.0 and the recently
proposed enhanced GLU2.0 sparse LU solver on the same set of circuit matrices.
Furthermore, GLU3.0 achieve 13.0X (arithmetic mean) and 6.7X (geometric mean)
speedup over GLU 2.0 and 7.1X (arithmetic mean) and 4.8X (geometric mean) over
recent proposed enhanced GLU2.0 sparse LU solver on the same set of circuit
matrices.
GLU
solver version 3.0 (with dynamic resource allocation kernel), which includes all the sources codes, user
manual and some examples, are available at Github
now.
Please
send any problem, bug and comment regarding GLU to Prof. Sheldon Tan at stan@ece.ucr.edu.
J1. K. He, S. X.-D. Tan, H. Wang and G. Shi, “GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis”, IEEE Transactions on Very Large Scale Integrated Systems (TVLSI), vol. 24, no.3, pp.1140-1150, March 2016.
J2. Wai-Kong Lee, Ramachandra Achar, and Michel S Nakhla, “Dynamic GPU parallel sparse LU factorization for fast circuit simulation”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, (99):1–12, 2018.
J3. S. Peng and S. X.-D. Tan, “GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation”, available at http://arxiv.org/abs/1908.00204