There is no L16 anymore -- there used to be -- because there is a sequential bottleneck at the top of the tree(s), which essentially look like updates of the form:
for descendant d of supernode s:
L(:,s) -= L(:, d) * L(d, s)
The updates (potentially) overlap for each descendant, and so parallelism (beyond within the two BLAS calls of each update) would require forming the updates out-of-place. At that point, one might as well use a multifrontal method.
One also loses most of the tree parallelism, as the left-looking algorithm is 'lazier' and delays much more work up the tree.
On the subject of Nested Dissection: it is much better for regular grids, and even the test matrices nd12k
and nd24k
, but Catamari only currently has external dependencies on BLAS/LAPACK. I do not want to introduce a dependency on METIS or SCOTCH, for various reasons, but may end up writing my own version.
The typical approach -- for example, that of CHOLMOD -- is to run both AMD and ND and pick the one which has the least fill-in.
example/helmholtz_3d_pml.cc
file
tmt_sym
is 1.91
seconds instead of 2.03
All examples below use AMD orderings.
matrix, cat32_L1, cat64_L1, cat32_R1, cat64_R1, cat32_R16, cat64_R16, cmod32_L1, cmod64_L1,
tmt_sym, 1.89088, 2.25474, 1.92324, 2.25634, 1.00763, 1.30379, 2.1012, 3.0680
thermal2, 3.29861, 3.89342, 3.26115, 3.82871, 1.89834, 2.32425, 3.6633, 5.2152
gearbox, 1.36939, 1.43045, 1.55155, 1.61873, 0.40941, 0.45008, 1.3675, 2.2096
m_t1, 0.825436, 0.87577, 0.89618, 0.93711, 0.24694, 0.27950, 0.8566, 1.3072
pwtk, 1.37823, 1.45212, 1.6224, 1.56311, 0.38553, 0.44058, 1.4260, 2.7740
pkustk13, 1.07858, 1.11687, 1.24504, 1.21407, 0.31112, 0.37154, 1.0744, 1.6934
crankseg_1, 1.25972, 1.36413, 1.43748, 1.45669, 0.51927, 0.55034, 1.2410, 1.9591
cfd2, 2.65119, 2.70171, 3.16048, 3.16373, 0.78199, 0.82342, 2.6067, 4.3814
thread, 1.012, 1.02845, 1.19575, 1.17295, 0.33895, 0.36102, 0.9934, 1.5794
shipsec8, 1.7185, 1.75125, 2.07534, 2.09093, 0.58437, 0.59755, 1.8039, 2.8644
shipsec1, 1.12697, 1.1789, 1.28238, 1.31926, 0.36340, 0.39544, 1.1583, 1.8472
crankseg_2, 2.08119, 2.12663, 2.37047, 2.38609, 0.75082, 0.81076, 2.0107, 3.3047
fcondp2, 1.28993, 1.3568, 1.45696, 1.50828, 0.41335, 0.48961, 1.3592, 2.1532
af_shell3, 2.36675, 2.39461, 2.39382, 2.53698, 0.65958, 0.77397, 2.3770, 3.7261
troll, 3.25808, 3.38021, 3.80914, 3.81427, 0.86283, 0.89280, 3.3416, 5.5126
G3_circuit, 8.03093, 8.86872, 8.50468, 9.18864, 2.92317, 3.51574, 8.5418, 13.956
bmwcra_1, 2.97213, 3.03298, 3.45179, 3.4018, 0.92248, 0.97016, 2.9110, 4.7945
halfb, 2.00813, 2.10552, 2.27037, 2.29716, 0.68246, 0.73887, 2.0614, 3.3739
2cubes_sphere, 4.7711, 4.88986, 5.60801, 5.53781, 1.38084, 1.39393, 4.7257, 8.0171
ldoor, 4.37883, 4.76662, 4.76595, 4.91449, 1.29061, 1.52734, 4.6214, 7.1847
ship_003, 2.63432, 2.69341, 3.08127, 3.07252, 0.74378, 0.76606, 2.6143, 4.3821
fullb, 4.00514, 4.13158, 4.74068, 4.72948, 1.12001, 1.13249, 3.9892, 6.8104
inline_1, 6.6045, 6.91225, 7.24294, 7.43746, 1.94105, 2.19538, 6.4410, 10.620
pkustk14, 3.57993, 3.67562, 4.10412, 4.11408, 1.43396, 1.48762, 3.5130, 5.9537
apache2, 6.17809, 6.59958, 6.99521, 7.05247, 1.73449, 2.05187, 6.3980, 10.829
F1, 9.17289, 9.42292, 10.6162, 10.3759, 2.25953, 2.46324, 9.3014, 15.243
boneS10, 13.4802, 14.0628, 15.1039, 15.4846, 3.85127, 4.13259, 13.493, 22.148
nd12k, 14.4359, 14.5389, 16.3701, 16.5663, 3.54422, 3.61284, 15.032, 26.056
Trefethen_20000, 13.7265, 13.9185, 19.8924, 19.5728, 6.14171, 6.24273, 11.854, 19.051
nd24k, 72.1729, 72.882, 79.0641, 78.0581, 13.3898, 13.6881, 74.454, 135.64
bone010, TOO_LARGE, 232.102, TOO_LARGE, 249.502, TOO_LARGE, 41.4383, TOO_LARGE, 421.04
audikw_1, TOO_LARGE, 264.649, TOO_LARGE, 284.311, TOO_LARGE, 45.7641, TOO_LARGE, 493.04