These are chat archives for deeplearning4j/deeplearning4j/earlyadopters

24th
Apr 2016
Alex Black
@AlexDBlack
Apr 24 2016 00:44
@treo "I'll try to rearrage the lstm helpers tomorrow and see what this can bring"
some optimization work there is on my to-do list. ping me before you dive in to that
Adam Gibson
@agibsonccc
Apr 24 2016 00:46
@AlexDBlack a lot of the current optimization work is going to revolve around making operations less sequential, basically kinda like how you pre compute a computation graph and then run all the operations at once
most of the problems from the gpu optimizations involve the fact that operations are sequential with no way of hiding latency
Alex Black
@AlexDBlack
Apr 24 2016 00:47
right, we've discussed that elsewhere a bit
even so, still some stuff we can do for LSTMs apart from that
raver119
@raver119
Apr 24 2016 06:59
@AlexDBlack yea, we for runtime/incomplete dep analysis rearrangement would obviously help
as i’ve wrote yesterday, basic implementation for runtime thing is working now.
however, some polishing is needed there
deeplearning4j/nd4j#848
Alex Black
@AlexDBlack
Apr 24 2016 07:06
ok, great
Paul Dubs
@treo
Apr 24 2016 09:42
@AlexDBlack as @raver119 asked me to wait with that anyway, I'm not going to do anything about it just yet.
Paul Dubs
@treo
Apr 24 2016 10:39
@raver119 pushed the easy backend change for synthetic tests
raver119
@raver119
Apr 24 2016 11:28
awesome
i’ve just came back home, and will add configuration to cuda backend
i had enough manual modifications there. time to bring that stuff to user-level
Adam Gibson
@agibsonccc
Apr 24 2016 11:30
@raver119 if you get ambitious feel free to look at tad
I reduced allocations alot
It was using allocations for counting tads
raver119
@raver119
Apr 24 2016 11:30
oh, that’s definitely good news
Adam Gibson
@agibsonccc
Apr 24 2016 11:31
That's gone now :smile:
raver119
@raver119
Apr 24 2016 11:31
have you also checkec my update on tad use within broadcasts and similar things?
Adam Gibson
@agibsonccc
Apr 24 2016 11:31
Need to work on cpu reduce tonight
Then I'll check broadcast and Co
raver119
@raver119
Apr 24 2016 11:32
for cuda reduce is actually fine. at least for syntheticRNN tests i don’t see what else we could do there. 0 allocations, 0 tads used, all going blockwise.
Adam Gibson
@agibsonccc
Apr 24 2016 11:32
Fwiw I havent added the new tad in broadcast yet
No I guarantee its not
raver119
@raver119
Apr 24 2016 11:33
yesterday before going to sleep i’ve manually traced that
Adam Gibson
@agibsonccc
Apr 24 2016 11:33
There are still edge cases on all reduces till I get the new tad in there
raver119
@raver119
Apr 24 2016 11:33
not a single call to dimensional stuff from lstms
Adam Gibson
@agibsonccc
Apr 24 2016 11:33
@AlexDBlack had stuff in there if I remember
raver119
@raver119
Apr 24 2016 11:33
all was routed through blockwise operations
Adam Gibson
@agibsonccc
Apr 24 2016 11:34
You can't tell me there isn't a sum (2,3) or something in there ..
raver119
@raver119
Apr 24 2016 11:34
ok, i’ll check once again then. but i’m pretty sure - stuff that got my attention was in broadcast and pairwise
also i have few more ideas on async, will try to push async chance from 20 to 25% through backward blocks
but besides that java side should be considered safe. i have almost no ideas what we could do there without precompiler.
and that’s bad :(
raver119
@raver119
Apr 24 2016 11:43
i mean: we already have direct memory management, direct kernels execution, and even do that asynchronously if/when that’s possible. without excessive calls to free/malloc/memset betwen launches
not too much we can do else there
Paul Dubs
@treo
Apr 24 2016 12:18
@agibsonccc you broke windows compilation again :P https://gist.github.com/treo/f953b14d2d9f5a92af702cc555d18f7e
Adam Gibson
@agibsonccc
Apr 24 2016 12:26
Null ptr doesn't work on windows?
What?
Alex Black
@AlexDBlack
Apr 24 2016 12:26
the only sums in LSTMs are along 1 dimension... it's mostly mmuls (well, gemm), axpy, ops and sum(0)s there
Adam Gibson
@agibsonccc
Apr 24 2016 12:26
That's a c++ standard..
Paul Dubs
@treo
Apr 24 2016 12:26
wait a sec, I'll gist the full log
it looks like maybe something broke the cmake files, as it complians about a lot of options
Adam Gibson
@agibsonccc
Apr 24 2016 12:27
Right
I'll test it again as soon as I'm off the train..
I was going to look at reduce anyways
Paul Dubs
@treo
Apr 24 2016 12:30
I've updated the gist
it now contains the full compile log
Adam Gibson
@agibsonccc
Apr 24 2016 12:30
Vool
raver119
@raver119
Apr 24 2016 12:38
@treo mind running syntheticrnn through nvidia profiler on windows?
Patrick Skjennum
@Habitats
Apr 24 2016 12:51
has the histogram slowdown been fixed yet?
raver119
@raver119
Apr 24 2016 12:51
no
Patrick Skjennum
@Habitats
Apr 24 2016 12:52
well, thanks for the superfast answer:p
raver119
@raver119
Apr 24 2016 13:04
sry, i’m spending all my time on cuda profiling and related things.
after i’m done there - i’ll get back to usual things
Adam Gibson
@agibsonccc
Apr 24 2016 13:05
nothing else matters atm ;/
Paul Dubs
@treo
Apr 24 2016 13:14
@raver119 sure, can do, using your branch, i guess?
raver119
@raver119
Apr 24 2016 13:14
yes my branch, but make sure please in libnd4j bool debug = false
in NativeOps.cu
i’m not sure which state was committed
Paul Dubs
@treo
Apr 24 2016 13:15
on libnd4j it is just master?
raver119
@raver119
Apr 24 2016 13:15
yes, pull master
but check manually NativeOps.cu
if bool debug = false
in the head of that fi
file
Paul Dubs
@treo
Apr 24 2016 13:17
yep, had to set it to false
raver119
@raver119
Apr 24 2016 13:18
ok. i’m making that manageable from java now
so, make uber jar, and launch through nvidia visual profiler
and gief screenshots :)
let’s check what else we can do to tune things from java side
Alex Black
@AlexDBlack
Apr 24 2016 13:20
so just a heads up: I've started on those lstm optimizations I've been talking about
probably won't get too much implemented tonight, just checking the math etc on what I want to do
Paul Dubs
@treo
Apr 24 2016 13:20
:+1:
Alex Black
@AlexDBlack
Apr 24 2016 13:20
but there's definitely a few things there
raver119
@raver119
Apr 24 2016 13:22
cool :)
Adam Gibson
@agibsonccc
Apr 24 2016 13:24
new reduce is done drops mic
Paul Dubs
@treo
Apr 24 2016 13:26
those for loops look like they should be simd-able
Adam Gibson
@agibsonccc
Apr 24 2016 13:26
poke around I wouldn't doubt it
you have this neat commit access thing
it's kinda cool
I was going to look at reduce3 etc tomorrow
My brain is fried from TAD ;/
raver119
@raver119
Apr 24 2016 13:27
haha
This was reverse engineered through much pain
Paul Dubs
@treo
Apr 24 2016 13:29
reverse engineered?
Adam Gibson
@agibsonccc
Apr 24 2016 13:30
I've been playing around with every various permutation of shapes/strides for the last week or so
explain this to me
please
I would love to know what it does
raver119
@raver119
Apr 24 2016 13:31
i feel myself strange.
i had a hope that it’s you (At least) who knows how all that crazy stuff works :)
Adam Gibson
@agibsonccc
Apr 24 2016 13:31
TAD involves permuting dimensions (by shifting all of the ones you want to target all the way to the end)
oh I understand it
Paul Dubs
@treo
Apr 24 2016 13:32
@raver119 which information should I collect with the nsight profiler?
raver119
@raver119
Apr 24 2016 13:33
graphic view of kernel launches
Adam Gibson
@agibsonccc
Apr 24 2016 13:33
@treo but seriously please do add simd etc
raver119
@raver119
Apr 24 2016 13:33
through different streams
Adam Gibson
@agibsonccc
Apr 24 2016 13:33
after you're done with raver's stuff
it's a few lines
@raver119 what's the issue for broadcast and co?
raver119
@raver119
Apr 24 2016 13:34
deeplearning4j/nd4j#848
last message at the bottom
Paul Dubs
@treo
Apr 24 2016 13:34
I'll try to see if there is any perceivable difference.
Adam Gibson
@agibsonccc
Apr 24 2016 13:38
oh I see
Paul Dubs
@treo
Apr 24 2016 13:38
the if should probably be commented back in again :D
Adam Gibson
@agibsonccc
Apr 24 2016 13:38
So basically, I just need to do the cuda version for the block offsets
raver119
@raver119
Apr 24 2016 13:38
shit
no
i’m making all that configurable from java
Adam Gibson
@agibsonccc
Apr 24 2016 13:38
I'll do that pass tomorrow
raver119
@raver119
Apr 24 2016 13:38
will commit soon
everything cuda-related will be configurable from java
literally everything
  • some more on top of that
Paul Dubs
@treo
Apr 24 2016 13:39
I'll comment it back in for my test for now, just wanted to let you know that that is that one thing that currently spams the output
raver119
@raver119
Apr 24 2016 13:39
yea
there’s probably still some holders related to memory relocations
i hope you’ll be able to see them and prove it
Paul Dubs
@treo
Apr 24 2016 13:43
I should create the uberjar simply by running mvn package, right?
raver119
@raver119
Apr 24 2016 13:44
i’d add shade transform to make jar executable with proper class
by default i mean
Paul Dubs
@treo
Apr 24 2016 13:47
ok, now I get what my problem was: It picked up allocation suit as the entry point, not runTests
raver119
@raver119
Apr 24 2016 13:47
:)
Paul Dubs
@treo
Apr 24 2016 13:48
ok, so now it is running and collecting
oh great, the first 5 iterations are done :)
raver119
@raver119
Apr 24 2016 13:51
can i have screenshots? :)
Paul Dubs
@treo
Apr 24 2016 13:51
a single iteration takes 19.4 Seconds at the moment
I'll let it finish the second 5 iterations, to see how much the rising async hit ratio affects it
it started out at 12, then quickly went to 18, and is currently on 21.6
raver119
@raver119
Apr 24 2016 13:52
it won’t got above 21-22% there yet
but i need screenshots
Paul Dubs
@treo
Apr 24 2016 13:52
and, with the second 5 iterations done, a single iteration still takes 19.4 seconds
raver119
@raver119
Apr 24 2016 13:52
there still have to be bugs, i was really sleepy yesterday
Paul Dubs
@treo
Apr 24 2016 13:55
blob
blob
raver119
@raver119
Apr 24 2016 13:56
this things rocks too
what about streams stuff?
any pix there
?
Paul Dubs
@treo
Apr 24 2016 13:57
where would I find that?
raver119
@raver119
Apr 24 2016 13:57
nvidia visual profiler
sec
check those screenshots at left side
but screenshots you’ve linked are awesome too
Paul Dubs
@treo
Apr 24 2016 13:59
those are from nsight in visual studio
raver119
@raver119
Apr 24 2016 13:59
they proove my view of broadcasts being bottleneck
damn, i have awfull feeling that i need windows again
Paul Dubs
@treo
Apr 24 2016 14:00
why? :D
raver119
@raver119
Apr 24 2016 14:01
for those neat pics :)
Paul Dubs
@treo
Apr 24 2016 14:01
:)
raver119
@raver119
Apr 24 2016 14:02
and i’ll check wtf with transform there
it shouldn’t be that bad
d2d copies is nice.
h2d should be reconsiderd.
Paul Dubs
@treo
Apr 24 2016 14:03
lol, profiler encountered out of memory
raver119
@raver119
Apr 24 2016 14:04
profiler, or nd4j under profiler?
Paul Dubs
@treo
Apr 24 2016 14:04
profiler
raver119
@raver119
Apr 24 2016 14:04
phew
limit xmx to 1.5gv
gb
for nd4j
within my branch that’ll affect everything
Paul Dubs
@treo
Apr 24 2016 14:06
still failing with that, something is odd
according to jvisual vm, the profiler has 8gb to use
even with 20g memory it doesn't work :confused:
Paul Dubs
@treo
Apr 24 2016 14:23
oh, apparently I already had it, and should have just looked at the timeline :D
blob
That is from a run going just to the first 5 iterations
blob
Paul Dubs
@treo
Apr 24 2016 14:34
blob
blob
and a bit zoomed in to show that those things that look like they are parallel, aren't really
or at least not that much
raver119
@raver119
Apr 24 2016 16:19
not that much = asynchit ratio
+-
however, as i’ve said earlier i have two more ideas which might improve that ratio a bit
anyway, thanks for screenshots, they are awesome
especially those with timings
Patrick Skjennum
@Habitats
Apr 24 2016 17:07
is memory usage supposed to go through the roof when increasing the number of epochs?
if yes, why?
raver119
@raver119
Apr 24 2016 17:08
no.
it’s NOT supposed.
Patrick Skjennum
@Habitats
Apr 24 2016 17:08
well it's definitely happening here:s
raver119
@raver119
Apr 24 2016 17:08
memory should retain within -Xmx value
Patrick Skjennum
@Habitats
Apr 24 2016 17:08
got BSOD because of some fault page access
while training
raver119
@raver119
Apr 24 2016 17:08
cpu/gpu?
Patrick Skjennum
@Habitats
Apr 24 2016 17:09
cpu
raver119
@raver119
Apr 24 2016 17:09
$%@#%@$
got photo?
Patrick Skjennum
@Habitats
Apr 24 2016 17:15
no photo, however it doesn't seem to happen anymore
i ran training and got the usual msvcrt.dll EXCEPTION_ACCESS_VIOLATION crash, and just ran it a few more times and then suddenly BSOD
but the crashes are super inconsitent, i'm not really sure if it's dl4j or just my computer being a turd
raver119
@raver119
Apr 24 2016 17:26
we need crash logs.
Patrick Skjennum
@Habitats
Apr 24 2016 17:27
it's the same one that i've posted multiple times before: https://gist.github.com/Habitats/f41990e8150fb42f48692beea5b9f38b
it happens even when the memory usage is well below xmx
sometimes after 20 mins of training, sometimes after 9 hours
however, error is always the same
raver119
@raver119
Apr 24 2016 17:31
org.bytedeco.javacpp.Pointer.memset(Lorg/bytedeco/javacpp/Pointer;IJ)Lorg/bytedeco/javacpp/Pointer; (0 bytes)
are you just training model?
or you’re saving it?
or, by chance, you’re using spark?
Patrick Skjennum
@Habitats
Apr 24 2016 17:31
i train multiple models, and save all of them
i'm using spark to load data, not for training
load with spark -> train single model -> save model -> train -> save etc
raver119
@raver119
Apr 24 2016 17:33
is that text?
what kind of data you recieve from spark?
or serialized INDArrays?
Patrick Skjennum
@Habitats
Apr 24 2016 17:33
nah, all is text
raver119
@raver119
Apr 24 2016 17:34
how you make INDArrays out of it?
Patrick Skjennum
@Habitats
Apr 24 2016 17:34
but i had a really weird bug last week that happened when i put INDArrays in an RDD. 90% of the arrays got corrputed. and when i just used a .collect() before mapping my data to INDArrays it all worked fine
i crate INDArrays from float[] with Nd4j.create(...)
atm none of the INDArrays are in contact with spark though. i collect everything on the driver before doing that stuff
Patrick Skjennum
@Habitats
Apr 24 2016 17:40
fyi i havne't had a bsod in years on this computer, so it's pretty stable normally:p
raver119
@raver119
Apr 24 2016 17:42
i have no doubts that it’s our problem
Justin Long
@crockpotveggies
Apr 24 2016 17:57
hey guys I know there's quite a bit of C++ code floating around, so I create a fat JAR with mvn package I assume it won't properly build DL4J? Mostly because of ND4J?
Paul Dubs
@treo
Apr 24 2016 17:58
depends on what you are actually doing... I created an uberjar just hours ago and it worked
... on my computer.
And it would work on any other skylake based windows pc
Justin Long
@crockpotveggies
Apr 24 2016 17:59
just trying to build master really. I'm dropping it in my project until the release is ready
using OS X
Paul Dubs
@treo
Apr 24 2016 17:59
why would you want to have a fat jar?
I mean for that purpose?
Justin Long
@crockpotveggies
Apr 24 2016 17:59
Yea so I can just drop it into my project and not worry about the dependencies
Paul Dubs
@treo
Apr 24 2016 18:00
Why do you worry about dependencies? Are you not using maven, like you are supposed to?
Justin Long
@crockpotveggies
Apr 24 2016 18:00
using SBT
Paul Dubs
@treo
Apr 24 2016 18:00
that is using maven under the hood
Justin Long
@crockpotveggies
Apr 24 2016 18:00
which has some issues with dependencies (had to manually add the twelvemonkeys dependencies, etc.)
so I'd rather just drop a fat jar
for whatever reason SBT uses an Ivy mechanism and it doesn't pick up all dependencies in DL4J
for example...
Paul Dubs
@treo
Apr 24 2016 18:01
That will probably make your dependency problems worse, not better
Justin Long
@crockpotveggies
Apr 24 2016 18:02
libraryDependencies ++= Seq(
  "commons-io" % "commons-io" % "2.4",
  "com.google.guava" % "guava" % "19.0",
  "org.deeplearning4j" % "deeplearning4j-core" % "0.4-rc3.8",
  "org.deeplearning4j" % "deeplearning4j-nlp" % "0.4-rc3.8",
  "org.deeplearning4j" % "deeplearning4j-ui"  % "0.4-rc3.8",
  "org.nd4j"          % "nd4j-x86"            % "0.4-rc3.8",
  "org.nd4j" % "canova-nd4j-codec" % "0.0.0.14",
  "org.nd4j" % "canova-nd4j-image" % "0.0.0.14",
  "org.apache.spark" % "spark-core_2.11" % "1.6.1",
  "org.jblas" % "jblas" % "1.2.4",
  "com.twelvemonkeys.imageio" % "imageio-core" % "3.1.2",
  "com.twelvemonkeys.common" % "common-lang" % "3.1.2"
)
that's literally what I had to do to ensure that DL4J would work at runtime
Paul Dubs
@treo
Apr 24 2016 18:03
that looks quite wrong to me
Justin Long
@crockpotveggies
Apr 24 2016 18:03
it should
because it is
yet twelvemonkeys and a bunch of other stuff wasn't picked up by SBT
anyways I'm just at a prototyping stage and the faster I can iterate the better. the plan is to use the maven release when it's ready
for the time being if there's a quick way to create a fat jar it would be helpful
Paul Dubs
@treo
Apr 24 2016 18:06
You probably can create a dummy maven project and include the shade plugin, put all of your DL4J dependencies in there, and mvn package that up
Justin Long
@crockpotveggies
Apr 24 2016 18:07
ah didn't think about shade. thanks for your insight :)
Paul Dubs
@treo
Apr 24 2016 18:07
it is a crutch, but I think that is the fastest way you can probably get it sorted out - but it will include all the dependencies of dl4j, and can break your build in a lot of other unexpected ways
Justin Long
@crockpotveggies
Apr 24 2016 18:12
if I run into anything I'll report it here to help people in the future
Justin Long
@crockpotveggies
Apr 24 2016 19:04
okay so I assume that by creating this "proxy" project I'll also need to build ND4J separately and include it in the proxy's pom?
Patrick Skjennum
@Habitats
Apr 24 2016 19:07
does net.numParams() include biases, or only weights?
it says "weights" in the docs, but params imply biases as well?
wutzebaer
@wutzebaer
Apr 24 2016 19:18
will a geforce m840 still be faster than my cpu?
raver119
@raver119
Apr 24 2016 19:48
@wutzebaer right now cpu is faster then gpu. we still have things to polish/improve on cuda
however, not too much left
if you're interested - i'm usually have open issue @ nd4j/issues with updates on current state
or just read history here
current state: deeplearning4j/nd4j#848
wutzebaer
@wutzebaer
Apr 24 2016 19:57
ok thanks
raver119
@raver119
Apr 24 2016 19:57
in other words: we actively working on that
things gets changed/improved each day
wutzebaer
@wutzebaer
Apr 24 2016 19:58
i know, thats why i every few days =)
(pull)
raver119
@raver119
Apr 24 2016 19:58
:)
wutzebaer
@wutzebaer
Apr 24 2016 19:59
today i pulled and get "CMake Error: The source directory "D:/dl4j/libnd4j/blasbuild/cuda" does not appear to contain CMakeLists.txt." when trying to compile cuda, did anything change this week?
raver119
@raver119
Apr 24 2016 19:59
yea
new argument was added
./buildblala.sh blas cuda debug
wutzebaer
@wutzebaer
Apr 24 2016 19:59
ok seems to start compiling
raver119
@raver119
Apr 24 2016 19:59
:)
async/config changes were not merged to master yet
wutzebaer
@wutzebaer
Apr 24 2016 20:01
do you have any idea how far away you are from final 3.9?
raver119
@raver119
Apr 24 2016 20:01
yes
we know release date :)
wutzebaer
@wutzebaer
Apr 24 2016 20:01
yeah !
when?
raver119
@raver119
Apr 24 2016 20:01
may 16th
wutzebaer
@wutzebaer
Apr 24 2016 20:01
and is anything in time? ^^
raver119
@raver119
Apr 24 2016 20:02
sorry?
english isn't my first language :)
so sometimes i can fail getting the question
wutzebaer
@wutzebaer
Apr 24 2016 20:03
do you think all issues will be ready till then`? .. no problem i cannot tell if my sentence is correct, i'm german =)
raver119
@raver119
Apr 24 2016 20:03
i think most of them.
i personally see only one problem, that i don't see viable solution yet
but i hadn't spent time on it yet. just remember it's there, and that i should think about it :)
all other problems are getting solved one by one
wutzebaer
@wutzebaer
Apr 24 2016 20:06
maybe this one? deeplearning4j/nd4j#832
raver119
@raver119
Apr 24 2016 20:06
lol
definitely no
wutzebaer
@wutzebaer
Apr 24 2016 20:07
ok then i'm happy when i can save and load my rnn next month ^^
raver119
@raver119
Apr 24 2016 20:07
problem i'm really concerned with, is operations on 2+ tensors, when each of them has different, !=1 element wise stride
that's performance problem
wutzebaer
@wutzebaer
Apr 24 2016 20:08
sounds complicated =)
raver119
@raver119
Apr 24 2016 20:11
it's primitive thing to do. but i'm not sure yet how to do that in one kernel and keep it fast enough
we'll see
maybe separate access worth a try there...
wutzebaer
@wutzebaer
Apr 24 2016 20:12
ok, but these problems are more fun, just finding the crux, instead a big refactoring
raver119
@raver119
Apr 24 2016 20:13
that's luck-based :)
wutzebaer
@wutzebaer
Apr 24 2016 20:13
write a nn for finding the solution ^^
raver119
@raver119
Apr 24 2016 20:13
got training data?
:)
give me training data, and i'll pick the best solution next minute :)
wutzebaer
@wutzebaer
Apr 24 2016 20:16
ok wait i generate some millions of randomly arranged commands ^^
Dipendra K. Misra
@dkmisra
Apr 24 2016 20:16
why is Nd4j.rand using SynchronizationRandomNumber Generator. That thing is freezing my threads :(
raver119
@raver119
Apr 24 2016 20:16
i hope you remember old rule: "shit in -> shit out" :)
Dipendra K. Misra
@dkmisra
Apr 24 2016 20:16
I mean if someone wants synchronized generation they might well use a synchronized block.
raver119
@raver119
Apr 24 2016 20:27
@dkmisra please file an issue about that.
and someone will take a look, if there's really needed sync stuff
Patrick Skjennum
@Habitats
Apr 24 2016 21:02
who should i bug to get a better understanding of the UiServer graphs in 3.9? because the only one that makes sense to me is the score vs iter, and the docs are not very detailed
raver119
@raver119
Apr 24 2016 21:03
they are the same as in 3.8, just look a bit different
only layout was changed
Patrick Skjennum
@Habitats
Apr 24 2016 21:03
yeah alright, but i never used this in 3.8
thought it was new in 3.9
Patrick Skjennum
@Habitats
Apr 24 2016 21:04
yeah i read that, like 20 times
raver119
@raver119
Apr 24 2016 21:04
@treo i think idea about off-device shapeInfo was dumb now.
i should either bring it back to device
or, better, get rid of those in kernels at all
Paul Dubs
@treo
Apr 24 2016 21:04
@raver119 get rid of it in the kernels
the shapeinfo is used in a lot of places in java
raver119
@raver119
Apr 24 2016 21:05
i have another reason to get rid of that
global memory access isn't free in cuda, you know
Paul Dubs
@treo
Apr 24 2016 21:06
it never is, but on cpu we just tend to ignore our fast memory
raver119
@raver119
Apr 24 2016 21:06
even if i make that happen only in threadIdx.x==0, that's still not free.
Patrick Skjennum
@Habitats
Apr 24 2016 21:07
@treo nowhere on that page does it say what RW params are
raver119
@raver119
Apr 24 2016 21:08
at least i definitely should move all those things into shared memory
so we'll not rely on cache there
Paul Dubs
@treo
Apr 24 2016 21:08
@Habitats then you should probably hassle @chrisvnicholson or @AlexDBlack for better documentation :)
Patrick Skjennum
@Habitats
Apr 24 2016 21:09
none of you guys know:P?
raver119
@raver119
Apr 24 2016 21:09
yep
Paul Dubs
@treo
Apr 24 2016 21:09
I simply don't use it :D
For Score tracking I have my own listener that reports it via the MBeans interface
raver119
@raver119
Apr 24 2016 21:09
and i'm just minor guy fighting for nanoseconds here
hm.
Patrick Skjennum
@Habitats
Apr 24 2016 21:10
yeah, score tracking i understand. but i have no idea how to interpret the other stuff
raver119
@raver119
Apr 24 2016 21:10
actually moving all shapebuffers to sharedbuffers during op init isn't actually bad idea.
fast fix that helps.
i know max sharedbuffer size
so i can make it static any day
shape0 defines actual buffer length
so all other threads might read things
or i can add shape ranks as actual argument
Paul Dubs
@treo
Apr 24 2016 21:12
@Habitats You can always try to read the code of what it does, that should give you at least some idea where to look for more information. That might even result in a pull request with better documentation :D
raver119
@raver119
Apr 24 2016 21:12
to avoid spare sync
i'll add prototype now.
hm.
Paul Dubs
@treo
Apr 24 2016 21:13
I'd go for it as an actual argument
Makes it easier to understand what is actually going on
raver119
@raver119
Apr 24 2016 21:14
argument will require to always have device pointer there
sry, host
host pointer
and i'm not sure we should have it
Paul Dubs
@treo
Apr 24 2016 21:14
thought it would be an actual value
raver119
@raver119
Apr 24 2016 21:15
nono
to get information about that on c side
i should read it
but to read it on host side (to pass it to kernel later) - i should provide host memory pointer
right now - i have it there
but i can't guarantee that tomorrow i won't move it back to device
we pass pointers here, that are accessible for kernel
doesn't matters if that's host or device memory
that's already kernel
but it's a wrapper
so i can swap pointers :)
read things into shmem
and let them be cached there
Paul Dubs
@treo
Apr 24 2016 21:18
somewhat tangential to the problem, but I really wonder how well a K-Style Interpreter would work on a GPU
raver119
@raver119
Apr 24 2016 21:19
wtf is that?
Paul Dubs
@treo
Apr 24 2016 21:19
K Code does look pretty much like line noise, but usually outperforms C code
even though it is interpreted
raver119
@raver119
Apr 24 2016 21:19
oh
Paul Dubs
@treo
Apr 24 2016 21:19
it is a language in the APL family
this one?
Paul Dubs
@treo
Apr 24 2016 21:20
https://kx.com/ this one
or http://kparc.com/ if you care more about the language than the business
raver119
@raver119
Apr 24 2016 21:21
i wonder, what's maintenance cost for such code...
Paul Dubs
@treo
Apr 24 2016 21:21
the people who can read it, say it is not much harder to read than pure math
i'm not one of those :(
Paul Dubs
@treo
Apr 24 2016 21:22
who can read it
raver119
@raver119
Apr 24 2016 21:22
oh, all those 19 guys, including developer? :)
j/k
Paul Dubs
@treo
Apr 24 2016 21:22
you are not that much off :D
probably only one or two orders of magnitude
as I hear it is quite big in finance
Patrick Skjennum
@Habitats
Apr 24 2016 21:24
@treo i've tried looking at the code but it's not exactly self-documenting
Paul Dubs
@treo
Apr 24 2016 21:24
Most code isn't :D
raver119
@raver119
Apr 24 2016 21:24
nono
that code isn't part of "most"
you know, i've noticed some differencies
and they are not minor, you know :)
Patrick Skjennum
@Habitats
Apr 24 2016 21:25
the iterationDone() method is 100+ lines
raver119
@raver119
Apr 24 2016 21:25
oh
you're about histogram ui
Patrick Skjennum
@Habitats
Apr 24 2016 21:26
yeah
raver119
@raver119
Apr 24 2016 21:26
it grabs model params and model gradients
and renders them per-layer
Patrick Skjennum
@Habitats
Apr 24 2016 21:26
yeah i realized that much, but what is RW params
raver119
@raver119
Apr 24 2016 21:26
that's up to @AlexDBlack
Patrick Skjennum
@Habitats
Apr 24 2016 21:26
and why does my bias graphs look like this:
blob
Paul Dubs
@treo
Apr 24 2016 21:27
yeah, K code isn't that bad in the self documenting behavior, you just have to be able to read it :D http://kparc.com/k.txt is the complete language spec
Patrick Skjennum
@Habitats
Apr 24 2016 21:27
@raver119 yeah, exactly:P
raver119
@raver119
Apr 24 2016 21:29
@treo tbh, every time i hear interpreted language is better then C (which was used to write interpreter), i just understand - that guy who got that result is worse programmer, that guy who wrote interpreter
yet to see any exclusions out of that rule
worse C programmer i mean
Paul Dubs
@treo
Apr 24 2016 21:30
There is actually only a single reason why it outperforms C on average: It knows how to live in L1 Cache
raver119
@raver119
Apr 24 2016 21:30
interpeter written on C knows that
:)
Paul Dubs
@treo
Apr 24 2016 21:31
Right :D
with K being from the APL family, it is an array oriented language, so for the right problems, it is harder to shoot yourself in the foot (performance wise)
raver119
@raver119
Apr 24 2016 21:34
lisp... apl... i feel lambda-fan here :)
Paul Dubs
@treo
Apr 24 2016 21:34
I enjoy working at higher abstraction levels :)
There is an old version of an ancestor of the K interpreter, which showcases the C Style in which it is written... which is unusual to say the least
But really, for me it is more of a curiosity, as I also enjoy getting things done pragmatically
Paul Dubs
@treo
Apr 24 2016 21:44
found it :D Apparently it was for J: http://code.jsoftware.com/wiki/Essays/Incunabulum
raver119
@raver119
Apr 24 2016 21:45
...
Paul Dubs
@treo
Apr 24 2016 21:45
As I said, idle curiosity :D
but there is a reason to this madness
each line is meant to be read as a sentence
And I do understand that at the first look, it looks pretty much like lovecraftian horror :D
raver119
@raver119
Apr 24 2016 21:52
:)
so, at least prototype compiles