Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Jack Poulson
    @jack_poulson_gitlab
    Here are the new timings with default options:
    All examples below use AMD orderings.
    
    matrix,          cat32_L1,  cat64_L1,  cat32_R1,  cat64_R1, cat32_R16, cat64_R16, cmod32_L1, cmod64_L1,
    tmt_sym,         2.03752,   2.42344,   1.97694,   2.25002,  1.04546,   1.32892,   2.1012,    3.0680
    thermal2,        3.3677,    3.94325,   3.31478,   3.85482,  1.88106,   2.43595,   3.6633,    5.2152
    gearbox,         1.38586,   1.45658,   1.51962,   1.56940,  0.41316,   0.44123,   1.3675,    2.2096
    m_t1,            0.832103,  0.906501,  0.88886,   0.92574,  0.25411,   0.27249,   0.8566,    1.3072
    pwtk,            1.38888,   1.47697,   1.53955,   1.53603,  0.41134,   0.44548,   1.4260,    2.7740
    pkustk13,        1.0899,    1.13996,   1.18976,   1.19235,  0.33919,   0.34683,   1.0744,    1.6934
    crankseg_1,      1.27666,   1.32425,   1.41365,   1.45032,  0.52208,   0.55907,   1.2410,    1.9591
    cfd2,            2.72988,   2.76722,   3.06238,   3.06373,  0.82206,   0.83498,   2.6067,    4.3814
    thread,          1.03611,   1.04332,   1.19036,   1.17896,  0.36539,   0.36379,   0.9934,    1.5794
    shipsec8,        1.82365,   1.86772,   2.09901,   2.10686,  0.61486,   0.61018,   1.8039,    2.8644
    shipsec1,        1.14417,   1.22583,   1.30899,   1.32467,  0.36981,   0.39406,   1.1583,    1.8472
    crankseg_2,      2.09077,   2.22596,   2.41171,   2.39615,  0.77854,   0.78057,   2.0107,    3.3047
    fcondp2,         1.31768,   1.38832,   1.45988,   1.48378,  0.45648,   0.46541,   1.3592,    2.1532
    af_shell3,       2.26952,   2.41800,   2.41595,   2.52492,  0.66360,   0.75055,   2.3770,    3.7261
    troll,           3.30425,   3.39069,   3.76782,   3.77045,  0.90118,   0.92410,   3.3416,    5.5126
    G3_circuit,      8.16057,   8.95602,   8.47928,   9.20349,  2.95142,   3.59392,   8.5418,    13.956
    bmwcra_1,        2.97643,   3.02556,   3.21428,   3.32056,  0.95471,   0.97435,   2.9110,    4.7945
    halfb,           2.03219,   2.16028,   2.24219,   2.28853,  0.66141,   0.73061,   2.0614,    3.3739
    2cubes_sphere,   4.88601,   4.85487,   5.42715,   5.38627,  1.35472,   1.41172,   4.7257,    8.0171
    ldoor,           4.49532,   4.81082,   4.65843,   4.92040,  1.29690,   1.56836,   4.6214,    7.1847
    ship_003,        2.74721,   2.68653,   3.11311,   3.14731,  0.76997,   0.76221,   2.6143,    4.3821
    fullb,           4.13693,   4.13209,   4.79544,   4.76538,  1.16138,   1.15465,   3.9892,    6.8104
    inline_1,        6.73062,   6.95649,   7.33840,   7.38035,  2.24602,   2.21311,   6.4410,    10.620
    pkustk14,        3.61127,   3.70063,   4.12102,   4.12207,  1.47044,   1.47456,   3.5130,    5.9537
    apache2,         6.29419,   6.57763,   6.83748,   7.04460,  3.94046,   2.08516,   6.3980,    10.829
    F1,              9.26643,   9.34752,   10.3139,   10.5131,  2.36122,   2.45677,   9.3014,    15.243
    boneS10,         13.7249,   14.0227,   15.1543,   15.2642,  3.82731,   4.20886,   13.493,    22.148
    nd12k,           14.396,    14.5373,   16.3341,   16.2279,  3.51779,   3.51430,   15.032,    26.056
    Trefethen_20000, 14.1647,   14.4623,   18.8452,   18.6835,  5.96055,   5.97875,   11.854,    19.051
    nd24k,           72.5213,   72.4657,   77.7104,   78.0643,  13.2667,   13.5306,   74.454,    135.64
    bone010,         TOO_LARGE, 231.815,   TOO_LARGE, 251.036,  TOO_LARGE, 40.2652,   TOO_LARGE, 421.04
    audikw_1,        TOO_LARGE, 264.776,   TOO_LARGE, 285.410,  TOO_LARGE, 44.8276,   TOO_LARGE, 493.04
    Jed Brown
    @jedbrown
    :clap: cat64 penalty is quite small while cmod64 is huge. I take it there is no L16 because the granularity for parallelism is too small to pay off for many of these matrices?
    Is there a rule of thumb for when ND ordering is better? (I've rarely seen AMD be faster, but I'm usually looking at PDE problems.) Will Catamari choose automatically (if built appropriately)?
    Sameer Agarwal
    @sandwichmaker
    amd is always faster for me.. since I am not looking at pde problems :)
    Jack Poulson
    @jack_poulson_gitlab

    There is no L16 anymore -- there used to be -- because there is a sequential bottleneck at the top of the tree(s), which essentially look like updates of the form:

    for descendant d of supernode s:
      L(:,s) -= L(:, d) * L(d, s)

    The updates (potentially) overlap for each descendant, and so parallelism (beyond within the two BLAS calls of each update) would require forming the updates out-of-place. At that point, one might as well use a multifrontal method.

    One also loses most of the tree parallelism, as the left-looking algorithm is 'lazier' and delays much more work up the tree.

    On the subject of Nested Dissection: it is much better for regular grids, and even the test matrices nd12k and nd24k, but Catamari only currently has external dependencies on BLAS/LAPACK. I do not want to introduce a dependency on METIS or SCOTCH, for various reasons, but may end up writing my own version.

    The typical approach -- for example, that of CHOLMOD -- is to run both AMD and ND and pick the one which has the least fill-in.

    Jack Poulson
    @jack_poulson_gitlab
    Now that performance is competitive, I am going back and doing some refactoring and will add more unit tests.
    Jed Brown
    @jedbrown
    The software quality (and licensing) of partitioners holds back so much, but nobody has the will to change it. One could submit a CSSI proposal, but I'd be surprised if such a proposal would be funded because most panelists would think of it as merely replicating an existing product.
    Would you say the public interface is sufficiently stable that it's ready to add to PETSc?
    Jack Poulson
    @jack_poulson_gitlab
    Thankfully I am fantastic at spending time on things noone wants to fund :-D
    I would say the interface is stable
    It is a C++ interface though...
    There is also an interface that allows one to pass in their own ordering. So you can pass in an ND ordering if you want
    Jed Brown
    @jedbrown
    :laughing: We can handle C++ in implementations. If PETSc computes an ordering (ND), is there an interface to provide it to Catamari?
    Jack Poulson
    @jack_poulson_gitlab
    Yes
    Jed Brown
    @jedbrown
    Preempting my question.
    Jack Poulson
    @jack_poulson_gitlab
    Through the SymmetricOrdering class
    You have to do a little legwork
    But I have an example in the example/helmholtz_3d_pml.cc file
    Jed Brown
    @jedbrown
    Thanks for the pointer.
    Jack Poulson
    @jack_poulson_gitlab
    It might be a good idea to do some legwork on potential extensibility before such a CSSI proposal
    Jed Brown
    @jedbrown
    Yeah, I'd expect there is a lot of potential for shared-mem parallelism. I tend to be thinking about distributed parallelism, and ParMETIS has serious scalability problems (that everyone who runs unstructured problems at large scale writes one-off hacks to work around).
    "large scale" here means >~20k ranks.
    Jack Poulson
    @jack_poulson_gitlab
    I am loathe to get sucked into optimization for lab machines without any personal connection.
    Been there, done that, didn't sell the t-shirt
    Jed Brown
    @jedbrown
    I wouldn't ask you to, especially not without substantial funding. But it'd be a way to sell a new implementation via ECP or ASCR because it's an obstruction to being able to make a press release of App X doing hero run on machine Y (actual production science be damned).
    For me, the value of a clean implementation would be, say 50% single-node, 40% small/medium scale distributed, 10% hero scale. But if funders are willing to support that same work 80% due to the hero scale, I think the same work can be done.
    Jack Poulson
    @jack_poulson_gitlab
    I fully agree that readable, easily extendable, performant implementations are valuable at extreme scale. But there is a no man's land in the middle that I don't want to be the one to personally subsidize.
    My happy middle ground is to write permissively licensed software for machines I can own and operate.
    If there is an application that has positive benefit for humanity that I can get deeply involved in that this would aid, the situation would be different.
    Jed Brown
    @jedbrown
    Agreed; I was thinking that thinking about larger scale would be valuable "legwork on potential extensibility" in any case, even if it was out of scope for your implementation. And DOE might still be a way to get you money as part of a collaborative proposal, even if you didn't work on the distributed parts.
    Fusion :laughing:
    Jack Poulson
    @jack_poulson_gitlab
    One thing I've found invaluable in industry that doesn't seem to exist as much in academia is working closely with someone who is solely focused on the impact of a technology as opposed to centering the team on those building tools.
    Happy to take conversation on this offline.
    Jed Brown
    @jedbrown
    SciDAC kind of tries to do that; the apps proposals are joint between apps developers (scientists) and the institutes (tool builders). I think it's a good model, but has become too fragmented by sprawling disjointed institutes (like FASTMath).
    Me too; offline is fine.
    Jack Poulson
    @jack_poulson_gitlab
    I just simplified the postordering code in quotient and got another small speedup
    in most cases, about 20 milliseconds
    in one case, about 100 milliseconds
    I can probably play a similar trick to avoid constructing child lists in catamari
    The speedups get bigger for the bigger matrices
    Jack Poulson
    @jack_poulson_gitlab
    I just modified the relaxation to break ties based upon the number of introduced zeros in the children and got another modest speedup.
    e.g., tmt_sym is 1.91 seconds instead of 2.03
    Jack Poulson
    @jack_poulson_gitlab
    I just shaved a bit more off of the supernode and structure formation.
    Jack Poulson
    @jack_poulson_gitlab
    Here are some updated timings. There are still things to be tuned, but this seems to be a good checkpoint:
    All examples below use AMD orderings.
    
    matrix,          cat32_L1,  cat64_L1,  cat32_R1,  cat64_R1, cat32_R16, cat64_R16, cmod32_L1, cmod64_L1,
    tmt_sym,         1.89088,   2.25474,   1.92324,   2.25634,  1.00763,   1.30379,   2.1012,    3.0680
    thermal2,        3.29861,   3.89342,   3.26115,   3.82871,  1.89834,   2.32425,   3.6633,    5.2152
    gearbox,         1.36939,   1.43045,   1.55155,   1.61873,  0.40941,   0.45008,   1.3675,    2.2096
    m_t1,            0.825436,  0.87577,   0.89618,   0.93711,  0.24694,   0.27950,   0.8566,    1.3072
    pwtk,            1.37823,   1.45212,   1.6224,    1.56311,  0.38553,   0.44058,   1.4260,    2.7740
    pkustk13,        1.07858,   1.11687,   1.24504,   1.21407,  0.31112,   0.37154,   1.0744,    1.6934
    crankseg_1,      1.25972,   1.36413,   1.43748,   1.45669,  0.51927,   0.55034,   1.2410,    1.9591
    cfd2,            2.65119,   2.70171,   3.16048,   3.16373,  0.78199,   0.82342,   2.6067,    4.3814
    thread,          1.012,     1.02845,   1.19575,   1.17295,  0.33895,   0.36102,   0.9934,    1.5794
    shipsec8,        1.7185,    1.75125,   2.07534,   2.09093,  0.58437,   0.59755,   1.8039,    2.8644
    shipsec1,        1.12697,   1.1789,    1.28238,   1.31926,  0.36340,   0.39544,   1.1583,    1.8472
    crankseg_2,      2.08119,   2.12663,   2.37047,   2.38609,  0.75082,   0.81076,   2.0107,    3.3047
    fcondp2,         1.28993,   1.3568,    1.45696,   1.50828,  0.41335,   0.48961,   1.3592,    2.1532
    af_shell3,       2.36675,   2.39461,   2.39382,   2.53698,  0.65958,   0.77397,   2.3770,    3.7261
    troll,           3.25808,   3.38021,   3.80914,   3.81427,  0.86283,   0.89280,   3.3416,    5.5126
    G3_circuit,      8.03093,   8.86872,   8.50468,   9.18864,  2.92317,   3.51574,   8.5418,    13.956
    bmwcra_1,        2.97213,   3.03298,   3.45179,   3.4018,   0.92248,   0.97016,   2.9110,    4.7945
    halfb,           2.00813,   2.10552,   2.27037,   2.29716,  0.68246,   0.73887,   2.0614,    3.3739
    2cubes_sphere,   4.7711,    4.88986,   5.60801,   5.53781,  1.38084,   1.39393,   4.7257,    8.0171
    ldoor,           4.37883,   4.76662,   4.76595,   4.91449,  1.29061,   1.52734,   4.6214,    7.1847
    ship_003,        2.63432,   2.69341,   3.08127,   3.07252,  0.74378,   0.76606,   2.6143,    4.3821
    fullb,           4.00514,   4.13158,   4.74068,   4.72948,  1.12001,   1.13249,   3.9892,    6.8104
    inline_1,        6.6045,    6.91225,   7.24294,   7.43746,  1.94105,   2.19538,   6.4410,    10.620
    pkustk14,        3.57993,   3.67562,   4.10412,   4.11408,  1.43396,   1.48762,   3.5130,    5.9537
    apache2,         6.17809,   6.59958,   6.99521,   7.05247,  1.73449,   2.05187,   6.3980,    10.829
    F1,              9.17289,   9.42292,   10.6162,   10.3759,  2.25953,   2.46324,   9.3014,    15.243
    boneS10,         13.4802,   14.0628,   15.1039,   15.4846,  3.85127,   4.13259,   13.493,    22.148
    nd12k,           14.4359,   14.5389,   16.3701,   16.5663,  3.54422,   3.61284,   15.032,    26.056
    Trefethen_20000, 13.7265,   13.9185,   19.8924,   19.5728,  6.14171,   6.24273,   11.854,    19.051
    nd24k,           72.1729,   72.882,    79.0641,   78.0581,  13.3898,   13.6881,   74.454,    135.64
    bone010,         TOO_LARGE, 232.102,   TOO_LARGE, 249.502,  TOO_LARGE, 41.4383,   TOO_LARGE, 421.04
    audikw_1,        TOO_LARGE, 264.649,   TOO_LARGE, 284.311,  TOO_LARGE, 45.7641,   TOO_LARGE, 493.04
    I think the release notes claims of being competitive with CHOLMOD are now pretty defensible.
    Jack Poulson
    @jack_poulson_gitlab
    The prototype linear programming interior point method in https://gitlab.com/hodge_star/conic now passes in modest numbers of iterations for the netlib LP test suite
    it also works in double-double precision
    Jack Poulson
    @jack_poulson_gitlab
    err, the greenbea netlib LP test problem seems to be nondeterministic in its double-precision accuracy. My guess is that this has to do with extended precision rounding states -- which are known to be nondeterministic
    Jack Poulson
    @jack_poulson_gitlab
    The problem seems to only occur when multithreading is enabled in the linear solver, so it's possible something else is at play.
    Jack Poulson
    @jack_poulson_gitlab
    I believe the issue is resolved as being due to aggravation of floating-points slight non-associativity: the OpenMP task parallelization leads to different orders of operations in the sparse-direct solver, and for matrices on the border of instability, the branch can lead to a tolerance being slightly met vs. slightly missed.