These are chat archives for elemental/chat

25th
Aug 2016
Ryan H. Lewis
@rhl-
Aug 25 2016 03:12
@poulson: you there?
try this: /usr/bin/LeastSquares_mpich --height 1000000 --width 1000
Ryan H. Lewis
@rhl-
Aug 25 2016 03:18
I realized since I packaged the code, I can just yum install it :)

on Centos 6 you can add this to /etc/yum.repos.d/rhl-elemental.repo

[rhl-elemental]
name=Copr repo for elemental owned by rhl
baseurl=https://copr-be.cloud.fedoraproject.org/results/rhl/elemental/epel-6-$basearch/
skip_if_unavailable=True
gpgcheck=1
gpgkey=https://copr-be.cloud.fedoraproject.org/results/rhl/elemental/pubkey.gpg
enabled=1
enabled_metadata=1

and you can then yum install elemental-mpich or yum install elemental-openmpi

there are Centos 7 and most fedora's as well
for those else listening :)
that LSQR run I just showed is taking a very long time on one CPU and is using like 30-40 Gb of RAM
Ryan H. Lewis
@rhl-
Aug 25 2016 03:31
$ LeastSquares_mpich --height 10000 --width 100
Starting LeastSquares... 0.366974 seconds.
$ LeastSquares_mpich --height 100000 --width 100
Starting LeastSquares... 5.69573 seconds.
$ LeastSquares_mpich --height 1000000 --width 100
66.9531 seconds.
Jack Poulson
@poulson
Aug 25 2016 03:39
seems pretty reasonable
Ryan H. Lewis
@rhl-
Aug 25 2016 03:40
so I guess 1 million x 1000 is slower than we think?
Jack Poulson
@poulson
Aug 25 2016 03:40
have you tried the QR driver?
and the TSQR driver?
Ryan H. Lewis
@rhl-
Aug 25 2016 03:40
let me find that
which one is LSQR a wrapper around, QR or TSQR?
Jack Poulson
@poulson
Aug 25 2016 03:40
tests/lapack_like/QR
QR, but it could easily call TSQR
Ryan H. Lewis
@rhl-
Aug 25 2016 03:41
QR_mpich --height 1000000 --width 1000
how do you make elemental use threads?
Jack Poulson
@poulson
Aug 25 2016 03:41
it's possible it is the application of the Householder transformations to the right-hand sides that is taking so long
Ryan H. Lewis
@rhl-
Aug 25 2016 03:41
i'm trying to export the OPENBLAS_NUM_THREADS environment variable
Jack Poulson
@poulson
Aug 25 2016 03:41
that would happen automatically
Ryan H. Lewis
@rhl-
Aug 25 2016 03:42
hm?
OPENBLAS_NUM_THREADS is not seemingly doing anything
I have a different machine: /usr/bin/LeastSquares_mpich --height 1000000 --width 1000 ~ 800.433 seconds.
that seems reasonably
so should OPENBLAS_NUM_THREADS make things use threads?
Jack Poulson
@poulson
Aug 25 2016 03:45
if OpenBLAS was built with threading support that will cause OpenBLAS to use threads
Ryan H. Lewis
@rhl-
Aug 25 2016 03:46
I think there is an openblas with threads
Jack Poulson
@poulson
Aug 25 2016 03:46
all openblas does if you compile it with that option
Ryan H. Lewis
@rhl-
Aug 25 2016 03:46
i'm using OpenBLAS provided by RHEL
so its more of what they provide..
Jack Poulson
@poulson
Aug 25 2016 03:47
it sounds like they built it without threads
which is the right default, IMO
Ryan H. Lewis
@rhl-
Aug 25 2016 03:47
great, there is also an 'openblas-threads' package
[rlewis@skynet03 yum.repos.d]$ rpm -ql openblas-threads
/usr/lib64/libopenblasp-r0.2.18.so
/usr/lib64/libopenblasp.so.0
Jack Poulson
@poulson
Aug 25 2016 03:48
have you run the QR driver?
Ryan H. Lewis
@rhl-
Aug 25 2016 03:48
Can I just point openblas at it
Jack Poulson
@poulson
Aug 25 2016 03:48
it tells you the GFlop/s
Ryan H. Lewis
@rhl-
Aug 25 2016 03:48
the QR driver is still running
i did the 1 million by 1000
Jack Poulson
@poulson
Aug 25 2016 03:48
did you try it on a modest problem size first?
...
Ryan H. Lewis
@rhl-
Aug 25 2016 03:48
This message was deleted
This message was deleted
$ QR_mpich --height 1000000 --width 1000
The QR driver is not outputting GFlops

$ QR_mpich --height 1000 --width 100
Optional arguments:
--height [int,100,1000,found]
height of matrix

--width [int,100,100,found]
width of matrix

--nb [int,96,96,NOT found]
blocksize

--print [bool,0,0,NOT found]
print matrices?

Out of 0 required arguments, 0 were not specified.
Out of 4 optional arguments, 2 were not specified.

|| A ||_F = 182.869
|| A - Q R ||_F / || A ||_F = 8.06207e-16
|| I - Q^H Q ||_F / || A ||_F = 2.85771e-17

Jack Poulson
@poulson
Aug 25 2016 03:50
tests/lapack_like/QR?
that is not what the output looks like
Ryan H. Lewis
@rhl-
Aug 25 2016 03:50
uh, no clue
ive installed all the binaries
there is no directory structure
Jack Poulson
@poulson
Aug 25 2016 03:51
it sounds like two drivers named QR overwrote each other
I don't like the idea of the directory structure being collapsed
not without prefixing the output names with their original paths
Ryan H. Lewis
@rhl-
Aug 25 2016 03:52
we can fix up the package/build system to name the binaries like the targets in CMake
amusingly if we do that, I can just yum update after the rebuild and get the new package
I can get GFlops from other things
like Trmm
@poulson we should also cut a prelease version of Elemental
Jack Poulson
@poulson
Aug 25 2016 03:56
testing the QR factorization is the relevant thing for LeastSquares
Ryan H. Lewis
@rhl-
Aug 25 2016 03:56
I understand
I can't easily compile code on this machine
Jack Poulson
@poulson
Aug 25 2016 03:56
ok
Ryan H. Lewis
@rhl-
Aug 25 2016 03:56
I can install onto it
Jack Poulson
@poulson
Aug 25 2016 03:57
is anyone else using that machine?
Ryan H. Lewis
@rhl-
Aug 25 2016 03:57
yeah..
Jack Poulson
@poulson
Aug 25 2016 03:57
well, it's possible the machine is overloaded
or you're being backgrounded
Ryan H. Lewis
@rhl-
Aug 25 2016 03:57
oh, you mean, right this minute?
Jack Poulson
@poulson
Aug 25 2016 03:57
running the Gemm test would say a lot
Ryan H. Lewis
@rhl-
Aug 25 2016 03:57
yeah, there is some other processes
Jack Poulson
@poulson
Aug 25 2016 03:58
whenever you tried running the million by 1000 case
it's worth knowing what type of floating point performance you see in the best case
Ryan H. Lewis
@rhl-
Aug 25 2016 03:58
here is a smaller run:
$ Gemm_mpich --height 100000 --width 1000
grid is 1 x 1
Sequential: 0.124429 secs (16.0734 GFlop/s)
Root waited for 5.098e-06 seconds
Populate root node: 0.0149701 secs
Spread from root: 0.0169712 secs
[MC, ] AllGather: 0.0012681 secs (807.506 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.0019307 secs (530.378 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0152383 secs (16.7998 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.000698979 secs (1464.99 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.00117335 secs (872.712 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0151498 secs (16.898 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.000699261 secs (1464.4 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.000874965 secs (1170.33 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0155041 secs (16.5118 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.000723784 secs (1414.79 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.00095122 secs (1076.51 MB/s) for 128 x 1000 local matrix
Local gemm: 0.015244 secs (16.7935 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.00104509 secs (979.825 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.000856856 secs (1195.07 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0148819 secs (17.2021 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.000716161 secs (1429.85 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.000910746 secs (1124.35 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0148516 secs (17.2372 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.000747893 secs (1369.18 MB/s) for1000 x 128 local matrix
[
,MR] AllGather: 0.000904801 secs (1131.74 MB/s) for 128 x 1000 local matrix
Local gemm: 0.0150077 secs (17.0579 GFlop/s) for 1000 x 1000 x 128 product
[MC, ] AllGather: 0.00058561 secs (1420.74 MB/s) for1000 x 104 local matrix
[
,MR] AllGather: 0.00068596 secs (1212.9 MB/s) for 104 x 1000 local matrix
Local gemm: 0.0122148 secs (17.0286 GFlop/s) for 1000 x 1000 x 104 product
Distributed Gemm: 0.134058 secs
Gathered to root: 0.00654163 secs
Jack Poulson
@poulson
Aug 25 2016 03:59
16 GFlop/s is pretty good
everything looks fine there
Ryan H. Lewis
@rhl-
Aug 25 2016 04:00
so our conclusion is that Elemental is working fine
Jack Poulson
@poulson
Aug 25 2016 04:00
well, Gemm is working fine
I would try to figure out the next bit by running the QR test driver
Ryan H. Lewis
@rhl-
Aug 25 2016 04:02
the LSQR driver works as I expect it.
so, my 3.5 hour run is clearly not due to some horrible bug in elemental (unless there was one from our slightly older build)
we are deploying of "some version" of elemental
perhaps from like 6-7 months ago
Jack Poulson
@poulson
Aug 25 2016 04:03
I can't think of any version that would cause such a thing
Ryan H. Lewis
@rhl-
Aug 25 2016 04:03
indeed
probably something else.
Jack Poulson
@poulson
Aug 25 2016 04:04
so Least Squares is working on the current version?
Ryan H. Lewis
@rhl-
Aug 25 2016 04:04
I showed you the driver output
~13 minutes for 1 million x 1000
Jack Poulson
@poulson
Aug 25 2016 04:04
OK
Ryan H. Lewis
@rhl-
Aug 25 2016 04:04
that is consistent with all our numbers thusfar, no?
it certainly is in like with O(mn^2)
line*
Jack Poulson
@poulson
Aug 25 2016 04:04
13 minutes is a little slow
Ryan H. Lewis
@rhl-
Aug 25 2016 04:05
what would you expect?
Jack Poulson
@poulson
Aug 25 2016 04:05
4
Ryan H. Lewis
@rhl-
Aug 25 2016 04:05
on a single threads?
Jack Poulson
@poulson
Aug 25 2016 04:05
at least if your DGEMM can hit 16 GFlop/s
Ryan H. Lewis
@rhl-
Aug 25 2016 04:05
well, then something is wrong with LeastSquares.
Jack Poulson
@poulson
Aug 25 2016 04:06
hmm, my 4/3 coefficient from memory was wrong
it is 2 m n^2 when m >> n
I have 2 m n^2 - (2/3) n^3 recorded in Elemental
which reduces to 4/3 n^3 when m = n
I think it is the ApplyPackedReflectors routine taking the extra time
Ryan H. Lewis
@rhl-
Aug 25 2016 04:07
i remember that from T&B lol
2 m n^2 - (2/3) n^3
Jack Poulson
@poulson
Aug 25 2016 04:07
in qr::SolveAfter
there is a GitHub issue for this
Ryan H. Lewis
@rhl-
Aug 25 2016 04:10
I get that LSQR should take 2 minutes
where is the issue?
Jack Poulson
@poulson
Aug 25 2016 04:14
the qr::SolveAfter routine is not currently optimized for the case where there is only a small number of right-hand sides
Ryan H. Lewis
@rhl-
Aug 25 2016 04:14
Oh, like, 1 RHS
Jack Poulson
@poulson
Aug 25 2016 04:14
it is doing far more work than it needs to
yes
Ryan H. Lewis
@rhl-
Aug 25 2016 04:15
is it an easy fix?
Jack Poulson
@poulson
Aug 25 2016 04:15
it isn't much code to just apply the Householder reflectors one by one
yes, reasonably so, there are just a ton of different special cases
Ryan H. Lewis
@rhl-
Aug 25 2016 04:15
i think numRHS == 1 is an important special case :)
Ryan H. Lewis
@rhl-
Aug 25 2016 04:17
oh, thats what you mean
what are all these cases?
Jack Poulson
@poulson
Aug 25 2016 04:17
yes, there are 32 different routines to implement
Ryan H. Lewis
@rhl-
Aug 25 2016 04:18
is there really nothing in common between these 32 different routines?
Jack Poulson
@poulson
Aug 25 2016 04:18
there are essentially three different options for how to pack Householder vectors into a matrix
and then you can either apply from the left or right
and they can be either sequential or distributed
Ryan H. Lewis
@rhl-
Aug 25 2016 04:19
why do the first three options even matter?
Jack Poulson
@poulson
Aug 25 2016 04:19
the three options are LOWER/UPPER, HORIZONTAL/VERTICAL, and FORWARDS/BACKWARDS
because they all show up?
Ryan H. Lewis
@rhl-
Aug 25 2016 04:19
1 question: why even do this?
why not just allocate more memory?
if you ever want to reuse your matrix you have to copy it anyways!
Jack Poulson
@poulson
Aug 25 2016 04:20
just yesterday I answered a long sequence of questions from someone who cared very much about giving up a factor of 2 of their memory
Ryan H. Lewis
@rhl-
Aug 25 2016 04:20
lol right
ok
point taken
Jack Poulson
@poulson
Aug 25 2016 04:21
the right approach is to move the current implementations into a Block subroutine, implement an Unblocked version, and switch if there is a significant number of right-hand sides
Ryan H. Lewis
@rhl-
Aug 25 2016 04:22
what is a good heuristic for "signficant"
?
Jack Poulson
@poulson
Aug 25 2016 04:22
the overhead from the blocked algorithm is proportional to the blocksize
Ryan H. Lewis
@rhl-
Aug 25 2016 04:22
does Block vs Unblock mean 32 more implementations?
Jack Poulson
@poulson
Aug 25 2016 04:23
relative to an application to one vector
so having twice as many vectors as the blocksize is a decent cutoff
Ryan H. Lewis
@rhl-
Aug 25 2016 04:23
what is a block?
Ryan H. Lewis
@rhl-
Aug 25 2016 04:29
when you say unblocked, you mean, that instead of just sort of hanging onto the reflectors and doing a massive GEMM that you do little Gemv's each time ?
Jack Poulson
@poulson
Aug 25 2016 04:31
yes
rather than forming block Householder transformations by accumulating the Householder vectors
Ryan H. Lewis
@rhl-
Aug 25 2016 04:31
I see
Jack Poulson
@poulson
Aug 25 2016 04:32
the application should be much faster than the QR factorization itself though
where are the RHS:
template<typename F>
void ApplyPackedReflectors
( LeftOrRight side, UpperOrLower uplo,
VerticalOrHorizontal dir, ForwardOrBackward order,
Conjugation conjugation,
Int offset, const Matrix<F>& H, const Matrix<F>& t, Matrix<F>& A )
is that 'A' ?
Jack Poulson
@poulson
Aug 25 2016 04:33
yes
A is overwritten
H is the matrix packed with Householder transformations
(and possibly other things)
Ryan H. Lewis
@rhl-
Aug 25 2016 04:34
so, here is a case where I feel like some abstraction should make way less code
if this wasnt distributed memory, I would say, "iterators"
oh wait, they aren't distmatrices!
When you apply one at a time you are just doing A = A - v(v'A) so if you can represent v with a view shouldn't this be like an Axpy?
i probably forgot a scalar?
here the householder is I - vv'
?
Ryan H. Lewis
@rhl-
Aug 25 2016 04:41
alright, well, this seems like a change I can think about making
Ryan H. Lewis
@rhl-
Aug 25 2016 04:48
@poulson i'm closing a few issues on github
Jack Poulson
@poulson
Aug 25 2016 04:50
when applied from the left, one does (I - tau v v') A = A - tau v (v' A), where tau depends upon the normalization of v
if || v ||_2 = 1, it is tau = 2
so, more generally, tau = 2 / v' v
which implies one computes the projection onto the span of v and negates the sign
Ryan H. Lewis
@rhl-
Aug 25 2016 04:52
I mean, all I am saying is that if you abstract over where one finds v then you can write one algorithm, and then have a 32 case if block calling it in different ways, right?
Jack Poulson
@poulson
Aug 25 2016 04:52
not in the distributed cases
but, otherwise, sure
though the left and right applications are different
Ryan H. Lewis
@rhl-
Aug 25 2016 04:53
you can't use Attach in a distributed fashion?
Jack Poulson
@poulson
Aug 25 2016 04:53
and the storage differences mean you apply to different parts of the right-hand side matrix
making it work and making it work efficiently are very different!
having the distributed case be fast requires some care
I'll just say that the devil is in the details for those routines
Ryan H. Lewis
@rhl-
Aug 25 2016 04:54
I could try and tackle this, but, might be something you should look at
you seem to know what you are looking for.
@poulson: just went through the issues, and created a label for 'build-enhancement' and tagged all the things related to the build
I closed out a couple dupes and things that are clearly done
Jack Poulson
@poulson
Aug 25 2016 04:55
thanks!
I'll play around with an implementation for small numbers of right-hand sides for the case used by QR
which is LLVF
Ryan H. Lewis
@rhl-
Aug 25 2016 04:56
is there any hope for this issue: elemental/Elemental#8 ?
@poulson thanks for looking at LLVF :)
gonna head out for the night
Jack Poulson
@poulson
Aug 25 2016 05:38
there is hope for autotools if someone who is an expert in it spends the time on it
Ryan H. Lewis
@rhl-
Aug 25 2016 05:39
so should we just close it for now, the likelihood of this happening and that person wanting to maintain autotools
is low
if such a person appears, they could reopen
Jack Poulson
@poulson
Aug 25 2016 05:40
it's okay to leave it open as an enhancement request
given that the person that opened it self-assigned
I would say it is rude to close an issue assigned to someone else
Ryan H. Lewis
@rhl-
Aug 25 2016 05:40
ah didnt notice that
Ryan H. Lewis
@rhl-
Aug 25 2016 06:17
@poulson: you still there?
fix-installed-names
elemental/Elemental#167
Jack Poulson
@poulson
Aug 25 2016 13:46
I commented on it; the change makes perfect sense if the output directories are flattened, but it seems redundant otherwise
Jack Poulson
@poulson
Aug 25 2016 14:15
I'm going to put together an implementation of the ideas discussed in LAWN 153 into Elemental so that we can drop the ScaLAPACK dependence without much pain: http://www.netlib.org/lapack/lawnspdf/lawn153.pdf
this should kill two birds with one stone and help with the Debian packaging
Ryan H. Lewis
@rhl-
Aug 25 2016 14:36
@poulson you there?
I replied..
Jack Poulson
@poulson
Aug 25 2016 14:38
yes
Ryan H. Lewis
@rhl-
Aug 25 2016 14:38
essentially Im leaving the build unchanged
Jack Poulson
@poulson
Aug 25 2016 14:38
but not the usual installation
Ryan H. Lewis
@rhl-
Aug 25 2016 14:38
but if you type make install it renames all the examples
and all the tests
im testing it now..
Jack Poulson
@poulson
Aug 25 2016 14:39
the issue I was referring to applies equally well to the installation
Ryan H. Lewis
@rhl-
Aug 25 2016 14:39
you mean after make ?
Jack Poulson
@poulson
Aug 25 2016 14:39
after make install
Ryan H. Lewis
@rhl-
Aug 25 2016 14:39
right, so after make install all the binaries are renamed in there installed to locations to be named uniquely.
so tests/lapack_like/QR goes to /usr/bin/test-lapack_like-QR
hm. odd. it semed to do this with a directory structure..
do you want me to rename the binaries in the directories as well?
Jack Poulson
@poulson
Aug 25 2016 14:43
for most users, the binary would be installed into /usr/bin/tests/lapack_like/
Ryan H. Lewis
@rhl-
Aug 25 2016 14:43
yeah, thats what its doing somehow, not what I expected or desired
because those directories are not on the path
Jack Poulson
@poulson
Aug 25 2016 14:44
sure, but they're example/test drivers
Ryan H. Lewis
@rhl-
Aug 25 2016 14:44
yeah
i wanted to see /usr/bin/tests-lapack_like-FOO
Jack Poulson
@poulson
Aug 25 2016 14:44
I think choosing between the two should be a CMake option
Ryan H. Lewis
@rhl-
Aug 25 2016 14:45
if you install it should just not clobber
why have on option that allows clobbering?
Jack Poulson
@poulson
Aug 25 2016 14:45
what do you mean by "clobbering"?
Ryan H. Lewis
@rhl-
Aug 25 2016 14:46
like tests/lapack_like/QR --> /usr/bin/QR and /examples/ll/QR --> /usr/bin/QR
Jack Poulson
@poulson
Aug 25 2016 14:46
I agree that neither of those should be possible
Ryan H. Lewis
@rhl-
Aug 25 2016 14:46
why the heck does: RENAME tests-${TYPE}-${TESTNAME} add a directory structure?
Jack Poulson
@poulson
Aug 25 2016 14:46
either /usr/bin/tests-lapack_like-QR or /usr/bin/tests/lapack_like/QR
it doesn't, it's already there
Ryan H. Lewis
@rhl-
Aug 25 2016 14:47
where is that directory structure?
Jack Poulson
@poulson
Aug 25 2016 14:47
are you testing the Fedora spec file or a vanilla install?
Ryan H. Lewis
@rhl-
Aug 25 2016 14:47
no i just typed make install
ohhh
Jack Poulson
@poulson
Aug 25 2016 14:47
look for the OUTPUT_DIR variable
Ryan H. Lewis
@rhl-
Aug 25 2016 14:47
i need my cmake option :)
it doesn't seem to work: -- Installing: /usr/local/bin/Circulant
Jack Poulson
@poulson
Aug 25 2016 14:48
the more I think about it, the more I agree that installing with the directory structure squashed into the name seems better
what doesn't work?
Ryan H. Lewis
@rhl-
Aug 25 2016 14:49
my RENAME flag
Jack Poulson
@poulson
Aug 25 2016 14:49
do you see that OUTPUT_NAME is already specified?
Ryan H. Lewis
@rhl-
Aug 25 2016 14:50
yeah, but doesn't RENAME, you know, RENAME
Jack Poulson
@poulson
Aug 25 2016 14:53
This message was deleted
Ryan H. Lewis
@rhl-
Aug 25 2016 14:54
ok, so just removing it does rename things, but, then it would be renamed even when not squashed..
rather, removing OUTPUT_NAME makes it work
Jack Poulson
@poulson
Aug 25 2016 15:00
OK
Ryan H. Lewis
@rhl-
Aug 25 2016 15:16
should I just remove OUTPUT_NAME?
actually, i know, i'll do that only if you ask to flatten
Jack Poulson
@poulson
Aug 25 2016 15:18
whatever leads to consistent, non-redundant output names is fine by me
Ryan H. Lewis
@rhl-
Aug 25 2016 15:21
ok, I have a decent way of doing this
check it out now
essentially when you don't flatten directories nothing changes
and when you flatten it does the rename
want to merge it? then the copr repo will rebuild :)