These are chat archives for deeplearning4j/deeplearning4j/earlyadopters

25th
May 2016
Patrick Skjennum
@Habitats
May 25 2016 00:00
yeah, but maybe there were some shit left in the blasbuild folder
Justin Long
@crockpotveggies
May 25 2016 00:00

found out why I was having trouble with nd4j-kryo:

Failed to execute JavaCPP Builder: Cannot run program "g++"

on mvn install...

how can I switch to my g++-6?
Patrick Skjennum
@Habitats
May 25 2016 00:01
@treo rm -rf'd blasbuild and rebuilt with mkl, problem persist
so it's not that
Paul Dubs
@treo
May 25 2016 00:02
@crockpotveggies if you want to use GCC 6 everywhere, why don't you just use update-alternatives like here: https://gist.github.com/treo/e33259c7a530bb70e25dbd2a98952682#file-setup-dependencies-sh-L9
Justin Long
@crockpotveggies
May 25 2016 00:03
not a bad idea
I just created symlinks so let's see if that works
Paul Dubs
@treo
May 25 2016 00:03
That's basically what update-alternatives does as well :)
Justin Long
@crockpotveggies
May 25 2016 00:03
oh ha
TIL
Paul Dubs
@treo
May 25 2016 00:03
@Habitats I'm pretty much out of ideas there
@crockpotveggies it does a bit more though :)
Justin Long
@crockpotveggies
May 25 2016 00:04
yea I'm going to implement it (symlink probably isn't comprehensive)
Patrick Skjennum
@Habitats
May 25 2016 00:16
@treo alright, found the solution
the solution was to install libopenblas-dev
without ever using it
and still linking with mkl
it's working
Patrick Skjennum
@Habitats
May 25 2016 00:18
yeah i don't even care anymore at this point
why do i need openblas installed for mkl to work!?
i thought they were like literally the same thing, from the outside
Justin Long
@crockpotveggies
May 25 2016 00:19
@Habitats do you have libiomp5 installed?
Patrick Skjennum
@Habitats
May 25 2016 00:19
yes
Justin Long
@crockpotveggies
May 25 2016 00:19
yea I don't know
Paul Dubs
@treo
May 25 2016 00:20
Really, you shouldn't need openblas installed, not sure what is going on there
Patrick Skjennum
@Habitats
May 25 2016 00:21
:(
maybe if i reinstall mkl that'll also work
Adam Gibson
@agibsonccc
May 25 2016 00:21
@Habitats set MKL_VERBOSE=1
Patrick Skjennum
@Habitats
May 25 2016 00:21
i'll try that on the next one
Adam Gibson
@agibsonccc
May 25 2016 00:21
I want to see if you're actually using mkl
Patrick Skjennum
@Habitats
May 25 2016 00:21
it says mkl a bunch of places when building libnd4j
Adam Gibson
@agibsonccc
May 25 2016 00:22
just do it :P
Patrick Skjennum
@Habitats
May 25 2016 00:22
but why would my program run without errors if the libs are fucked:(?
Adam Gibson
@agibsonccc
May 25 2016 00:24
The problem is def between chair and keyboard
Anything could happen with you mixing and matching stuff
I know you were doing all sorts of benchmarking etc
Patrick Skjennum
@Habitats
May 25 2016 00:24
"use deeplearning" they said
"it'll be fun" they said
Adam Gibson
@agibsonccc
May 25 2016 00:25
If you had been anywhere else the only support you probably would have gotten were unanswered emails
Could have been worse ;)
Patrick Skjennum
@Habitats
May 25 2016 00:26
if i was anywhere else i'd never use these excruciating snapshots:P
Adam Gibson
@agibsonccc
May 25 2016 00:27
you'd still deal with the same problems though eg: native code etc
It'd also be harder to do multithreading
you have to give java props for that at least
Patrick Skjennum
@Habitats
May 25 2016 00:28
sure, it's nice when it works
had a friend come up to me the other day, said he was using deeplearning for his thesis
i asked him what he used, and he said he just put his csv through tenserflow and let it run for 2 hours and he had f-scores in the 90s
><
Adam Gibson
@agibsonccc
May 25 2016 00:29
you're doing nlp though
tensorflow doesn't have word2vec
It varies by the problem
not to mention setting up an LSTM + word2vec etc
Patrick Skjennum
@Habitats
May 25 2016 00:30
yeah i like my problem, and i've learned a metric ton of stuff hanging around
but my f-scores are like in the 30s lol
it's worse than naive bayes atm
nothing works out:D
Adam Gibson
@agibsonccc
May 25 2016 00:30
A lot of it is tuning though
That's not particular to any DL framework
Patrick Skjennum
@Habitats
May 25 2016 00:31
yeah but i should be able to surpass my naive bayes with a simple feedforward network
with minimum effort
Adam Gibson
@agibsonccc
May 25 2016 00:31
Right but again framework doesn't matter here
I'm more trying to show you that the problem matters here
Patrick Skjennum
@Habitats
May 25 2016 00:32
no i'm just frustrated my results suck ass:P
but yeah, i have until sunday to figure something out, and i can use all the google servers i want
so if you have any suggestions shoot
Justin Long
@crockpotveggies
May 25 2016 01:26
@agibsonccc that Kryo fix seems incomplete, Kryo wants to register different classes like DataSet
Adam Gibson
@agibsonccc
May 25 2016 01:27
I only added INDArray to that
What does it ACTUALLY need to do?
That's what I'm trying to get at
Like what parts of the object graph does it need?
Does it also need DataBuffer etc as well
Justin Long
@crockpotveggies
May 25 2016 01:42
Right now I don't know enough, so I'm going to research this and see if there's a one-size-fits-all answer
at this moment I'm tuning Spark to see if the G1GC works, etc.
Patrick Skjennum
@Habitats
May 25 2016 01:55
i benchmarked g1gc vs cms while training earlier, and cms is way faster @crockpotveggies
tried with different xmx's
Justin Long
@crockpotveggies
May 25 2016 01:56
nice good to know
Patrick Skjennum
@Habitats
May 25 2016 01:57
blob
avg time/iteration, cms to the left
cms job finishes in 8 min, g1gc in 10
Justin Long
@crockpotveggies
May 25 2016 01:58
@Habitats I'm getting a lot of Remote RPC client disassociated errors
any chance you ran into this? for some reason my whole cluster is shitting the bed
Patrick Skjennum
@Habitats
May 25 2016 01:58
not that i'm aware of. tried tuning timeouts?
i just have trouble with heartbeats dying
Justin Long
@crockpotveggies
May 25 2016 02:01
oh wait this appears to be trouble:
java.lang.UnsatisfiedLinkError: /tmp/javacpp1058917180115244/libjnind4j.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /tmp/javacpp1058917180115244/libjnind4j.so)
standard C++ library is not found apparently
Patrick Skjennum
@Habitats
May 25 2016 02:02
i'm not running distributed:\
Justin Long
@crockpotveggies
May 25 2016 02:03
@saudet thoughts here?
going to try an install of gcc/g++ across all machines
hopefully it fixes it
Samuel Audet
@saudet
May 25 2016 02:16
CentOS 6?
Justin Long
@crockpotveggies
May 25 2016 02:17
Ubuntu
Samuel Audet
@saudet
May 25 2016 02:21
I'm pretty sure we can update the version of glibc...?
Looks like we need to install glibcxx separately
Justin Long
@crockpotveggies
May 25 2016 02:40
Okay will give it a shot just out for a run
Justin Long
@crockpotveggies
May 25 2016 05:11
@saudet you know what package has glibcxx in it? can't find it in apt-cache
libstdc++6 is already installed, I wonder if its not on the path
Justin Long
@crockpotveggies
May 25 2016 05:45
@saudet added /usr/lib/x86_64-linux to the LD_LIBRARY_PATH since stdlibc++ is there but still got this error:
Caused by: java.lang.UnsatisfiedLinkError: /tmp/javacpp1072583068714756/libjnind4j.so: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /tmp/javacpp1072583068714756/libjnind4j.so)
Samuel Audet
@saudet
May 25 2016 07:10
right, seems to be a version thing
Justin Long
@crockpotveggies
May 25 2016 16:25
I'm going to try removing gcc-6 and recompile with @treo config @saudet
Paul Dubs
@treo
May 25 2016 16:31
with my config? gcc-5.3?
:D
Justin Long
@crockpotveggies
May 25 2016 16:50
exactly. your moment to shine :D
Justin Long
@crockpotveggies
May 25 2016 16:58
@treo got the following warning:
update-alternatives: warning: skip creation of /usr/bin/g++ because associated file /usr/bin/g++-5 (of link group gcc) doesn't exist
Paul Dubs
@treo
May 25 2016 16:58
did you install gcc-5?
Justin Long
@crockpotveggies
May 25 2016 16:59
yup
for whatever reason, g++ isn't included with gcc
@treo installed g++-5 and got this when running update-alternatives:
update-alternatives: warning: forcing reinstallation of alternative /usr/bin/gcc-5 because link group gcc is broken
Paul Dubs
@treo
May 25 2016 17:01
which repository are you using?
Justin Long
@crockpotveggies
May 25 2016 17:02
I've added a bunch lately, though I did use the ones in your gist
which file has them all listed? repo.d?
(the drawback of using pure native)
Paul Dubs
@treo
May 25 2016 17:03
hmm.... just unexpected that it didn't install g++ automatically, anyway, the other warning tells you that it now works
also my setup gist is more of a guideline and less of an actual install script :)
Justin Long
@crockpotveggies
May 25 2016 17:03
okay cross our fingers that it solves my javacpp problem
the deeplearning4j repo is getting kinda big....1.26GB of history
Paul Dubs
@treo
May 25 2016 17:05
yep, I usually just don't build dl4j and get it from snapshots :)
but that is with maven
Justin Long
@crockpotveggies
May 25 2016 17:07
I need pure master experience :leaves:
Patrick Skjennum
@Habitats
May 25 2016 18:39
"You are attempting to install this product on more systems than the license
allows. See https://software.intel.com/en-us/faq/licensing for additional
information about license issues."
thanks intel
Paul Dubs
@treo
May 25 2016 18:41
so build openblas from source, and see if their claim about being as fast as MKL is right :D
Patrick Skjennum
@Habitats
May 25 2016 18:42
>_>
what's the signup link for mkl again
Patrick Skjennum
@Habitats
May 25 2016 19:00
thanks @treo
@crockpotveggies reason canova was cloned into nd4j earlier is because on github canova has a capital C, and the folder was already present with a capital C on my server. maybe include ignore-case rm -rf when removing repos?
Justin Long
@crockpotveggies
May 25 2016 19:13
smart, I'll update the script
@Habitats looks like there's proper ways to do it
Justin Long
@crockpotveggies
May 25 2016 19:44
@saudet you think std lib 6 is being loaded instead of the correct version? what if I preload using LD_PRELOAD?
Justin Long
@crockpotveggies
May 25 2016 19:49
@treo where does the stdlib for gcc 5 go?
I did a find /usr -name "libstd*" and all I can find are 4.8 and 6 files
ah here we go:
/usr/lib/x86_64-linux-gnu/5/
Patrick Skjennum
@Habitats
May 25 2016 20:36
guys, i did a fresh install now, no funny business: installed everything on a google instance, working fine, make snapshot, create EXACT same instance again, and score goes to NaN, using mkl
Justin Long
@crockpotveggies
May 25 2016 20:37
@Habitats I thought that was because the binaries are compiled specifically to the instance hence making the snapshot useless?
Patrick Skjennum
@Habitats
May 25 2016 20:37
which binaries
i ran your build-stack script on all of the instances before training
Justin Long
@crockpotveggies
May 25 2016 20:38
quoting @treo
When you build libnd4j it is compiled with -march=native, and if you get moved to other hardware that might create some problems
Patrick Skjennum
@Habitats
May 25 2016 20:38
like i said, i built everything locally
i even reinstalled mkl and that didn't change anything either
Patrick Skjennum
@Habitats
May 25 2016 20:47
also rebuilt openmp fwiw
Justin Long
@crockpotveggies
May 25 2016 20:49
but I thought you created a snapshot and then redeployed it?
Patrick Skjennum
@Habitats
May 25 2016 20:50
i did all of this on the deployed snapshots
and it still doesn't work
Justin Long
@crockpotveggies
May 25 2016 20:50
interesting that snapshots create these issues
you also used the build-from-master script and not maven central?
sometimes datacenters cache the maven repo
or sonatype
not that they should....
Patrick Skjennum
@Habitats
May 25 2016 20:53
build from maven? i build from github
i don't even have the snapshot url in my gradle repo list
Justin Long
@crockpotveggies
May 25 2016 21:01
I meant pulling in sonatype snapshots but yea, sorry, I'm not sure how to help
it seems to be related to your server snapshot
IIRC Google has a virtualization layer in their cloud
Patrick Skjennum
@Habitats
May 25 2016 21:03
you'd think snapshot isn't something they'd fuck up though
Justin Long
@crockpotveggies
May 25 2016 21:04
on a positive note, FINALLY fixed the libstdc++ issue and javacpp is linking correctly again :feelsgood:
Patrick Skjennum
@Habitats
May 25 2016 21:13
neat! i just got a native server
finally
Justin Long
@crockpotveggies
May 25 2016 21:26
it's more of a bitch to set up but native pays off for performance
Patrick Skjennum
@Habitats
May 25 2016 21:26
it's ubuntu so it's not that much of a problem
but it's pure
didn't even have vim installed
Justin Long
@crockpotveggies
May 25 2016 21:27
Spark distributed training takes 17min per iteration
note that iteration is parallelized
Patrick Skjennum
@Habitats
May 25 2016 21:27
that sounds about right
:P
Justin Long
@crockpotveggies
May 25 2016 21:27
heh
right now I'm averaging each iteration
however I'm going to let this run continually
even though I could turn off individual iteration averaging, I don't know how that would affect quality of training
if you take a look at that log, there's a ton of spark chatter (which is probably slowing down training)
lots of optimizations possible here
Patrick Skjennum
@Habitats
May 25 2016 21:30
it'd be nice with a blas switch on libnd4j build script
Justin Long
@crockpotveggies
May 25 2016 21:31
is it that hard to add?
you could probably do it < 10min
Patrick Skjennum
@Habitats
May 25 2016 21:32
probably not hard, no
Justin Long
@crockpotveggies
May 25 2016 21:41
I think it'd require tests to ensure the blas lib being linked is actually present
Justin Long
@crockpotveggies
May 25 2016 21:48
@Habitats are you using the SparkMultiLayer now? or vanilla DL4J?
Patrick Skjennum
@Habitats
May 25 2016 21:51
still no spark
Paul Dubs
@treo
May 25 2016 22:17
@Habitats how do the snapshots work on the google cloud? While running? Maybe mkl doesn't like to be cloned?
Patrick Skjennum
@Habitats
May 25 2016 22:26
while running yes
@treo it should be fixed when reinstalling then
haha now i got the same error on my native ubuntu box that i just got
what
only common theme here is that the boxes it's happening to is running either debian or ubutnu + mkl
Justin Long
@crockpotveggies
May 25 2016 22:33
so what's the current vision for a Spark parameter server? @agibsonccc @AlexDBlack
is each executor communicating with it specifically and sort of bypassing the mapreduce paradigm?
I'm reading through the code right now, just learning every bit of it
Adam Gibson
@agibsonccc
May 25 2016 22:39
@crockpotveggies parameter server won't live in spark
It's terrible for perf
Justin Long
@crockpotveggies
May 25 2016 22:40
looking through the code that's what I thought
I'm watching my cluster process along, and there's too much communication
Adam Gibson
@agibsonccc
May 25 2016 22:41
Spark doing any compute what so ever terrifies me
Justin Long
@crockpotveggies
May 25 2016 22:41
I'm having a look at SparkNet...the DL4J implementation pretty much follows the same paradigm?
(in its current form)
AVERAGE_EACH_ITERATION is probably too much
what I like about a parameter server is that parameters can be broadcast only when needed...it appears that each executor has to wait for the rest to finish before updating the master model
there's probably a streaming mechanism already developed for a different problemset that can apply here
Justin Long
@crockpotveggies
May 25 2016 22:47
what could happen is the parameter server performs averaging based on queues and/or timing mechanisms
when an executor finishes an iteration, it can send it to the server's queue to await processing
and when a model is broadcast to the executors, the model update is put in a queue until the executor's current iteration is finished
this would be an async way of handling broadcasts, and probably avoid the issue of all the Spark shuffling
Adam Gibson
@agibsonccc
May 25 2016 22:50
C++
That'sy only response
Huge reason for that
Justin Long
@crockpotveggies
May 25 2016 22:50
explain
perf?
Adam Gibson
@agibsonccc
May 25 2016 22:50
Yup
I don't trust heap or anything jvm for something that handles large matrices
Justin Long
@crockpotveggies
May 25 2016 22:51
not a bad idea. the averaging would be done natively which would achieve damn good performance
Adam Gibson
@agibsonccc
May 25 2016 22:51
Spark should only be orchestration
Justin Long
@crockpotveggies
May 25 2016 22:52
I haven't programmed in C++ though, but I can help lay groundwork
I'm going to write up a paper for discussion purposes
Adam Gibson
@agibsonccc
May 25 2016 22:52
Cool
Justin Long
@crockpotveggies
May 25 2016 22:52
the premise is that we develop a streaming, async parameter server
that should yield huge performance gains
Adam Gibson
@agibsonccc
May 25 2016 22:53
Right
Justin Long
@crockpotveggies
May 25 2016 22:54
regarding native code, are you thinking the parameter server is implemented entirely in C++?
or only the model operations?
Adam Gibson
@agibsonccc
May 25 2016 22:56
Everything
I hate the jvm
Should only be used for client code
It's good at that
Not for matrices
Justin Long
@crockpotveggies
May 25 2016 22:57
for usability purposes the Application Master should at least set it up
Adam Gibson
@agibsonccc
May 25 2016 22:57
Javacpp
Justin Long
@crockpotveggies
May 25 2016 22:57
yup
what do you think of using something like zeroMQ for model broadcasts?
Adam Gibson
@agibsonccc
May 25 2016 22:58
You are in the right neighborhood
Justin Long
@crockpotveggies
May 25 2016 22:59
streaming, async, deep learning...
I have a feeling this would scale in a very interesting way
Adam Gibson
@agibsonccc
May 25 2016 23:00
Need to think of how gpus will work
Eg nvlink
Edna
Rdma
I want the parameter server to work on a rack as we
Well
Justin Long
@crockpotveggies
May 25 2016 23:01
(hint: if you press arrow-up key you can edit previous post)
would RDMA work in a Spark environment?
The RDMA for Apache Spark package is a derivative of Apache Spark. This package can be used to exploit performance on modern clusters with RDMA-enabled interconnects for Big Data applications. Major features of RDMA for Apache Spark 0.9.1 are given below.
Adam Gibson
@agibsonccc
May 25 2016 23:08
@crockpotveggies mobile
Curently at airport ;)
Anyways yeah
That's the coolest thing caffe on spark did actually
Justin Long
@crockpotveggies
May 25 2016 23:10
@agibsonccc are you going to be in SFO on June 8-9?
Adam Gibson
@agibsonccc
May 25 2016 23:11
Yup
Tokyo first then sf
Justin Long
@crockpotveggies
May 25 2016 23:12
@agibsonccc check DM
Patrick Skjennum
@Habitats
May 25 2016 23:20
alright i threw out mkl altogether and everything works great now. it's even faster.
any tunings tips for a native xeon with ubuntu?
Adam Gibson
@agibsonccc
May 25 2016 23:21
Ubuntu is slow start over
:P
Patrick Skjennum
@Habitats
May 25 2016 23:22
a god damn you:P
Adam Gibson
@agibsonccc
May 25 2016 23:22
I would imagine it's the same
I mean it's all openmp
Treo
Helped you already
Patrick Skjennum
@Habitats
May 25 2016 23:22
the virtualization layer apparently fucked things up with libnd4j
now i'm not virtualizing anymore
Adam Gibson
@agibsonccc
May 25 2016 23:22
In what way?
Speed or?
Patrick Skjennum
@Habitats
May 25 2016 23:22
yeah
Adam Gibson
@agibsonccc
May 25 2016 23:22
Huh
Patrick Skjennum
@Habitats
May 25 2016 23:22
libnd4j crap its pants in vm
didn't you follow the discussion over the last 4 days:P? you even commented
Adam Gibson
@agibsonccc
May 25 2016 23:23
Right on
Ok
I thought maybe I was missing something
Patrick Skjennum
@Habitats
May 25 2016 23:23
had to force omp_num_threads=1 for anything to work
then treo did some magic and we didn't have to do that anymore
Adam Gibson
@agibsonccc
May 25 2016 23:24
Well what else was there?
Could you compare native vs virt?
That would help
Patrick Skjennum
@Habitats
May 25 2016 23:25
i'm running a bench as we speak
which includes everything from io to general processing to training
Adam Gibson
@agibsonccc
May 25 2016 23:25
Cool
Patrick Skjennum
@Habitats
May 25 2016 23:26
4 different servers, 3 with same hardware, different os
just need to wait for my stupid joyent to finish, which obviously takes forever
Patrick Skjennum
@Habitats
May 25 2016 23:33
Ubuntu native:     8 minutes 40 seconds (520355 ms),  average iteration: 340 ms
Google dataproc:  13 minutes 50 seconds (830619 ms),  average iteration: 578 ms
Joyent:           18 minutes (1080674 ms),            average iteration: 820 ms
Desktop:          18 minutes 10 seconds (1090409 ms), average iteration: 810 ms
@agibsonccc @treo
messed up the number order, fixed now
all servers are xeon's
more or less same model
Justin Long
@crockpotveggies
May 25 2016 23:45
@agibsonccc on average how large is a NN (in megabytes)? let's use AlexNet as a baseline
I'm trying to figure out how much data is being passed around with model broadcasts
@Habitats Ubuntu native FTW
In my logs I see the following broadcast:
16/05/25 14:32:21 INFO MemoryStore: Block broadcast_12 stored as values in memory (estimated size 258.3 KB, free 192.1 MB)
16/05/25 14:32:21 INFO HttpBroadcast: Reading broadcast variable 12 took 4.954941824 s
so can I assume each broadcast is <1MB in size?
Patrick Skjennum
@Habitats
May 25 2016 23:47
LSTM with 100 nodes i 1.7mb
FFN with 90k inputs and 1000, 700, 500 hideen layers is 70mb
FFN with 1000 inputs and 500, 300 hidden is 4mb
Justin Long
@crockpotveggies
May 25 2016 23:48
that's still not as bad as I expected
worst case scenario must be around 200MB
Alex Black
@AlexDBlack
May 25 2016 23:49
I think about 5M in GoogLeNet, 25M in inception v3
I've seen LSTMs with 380M parameters though
Patrick Skjennum
@Habitats
May 25 2016 23:49
my nets are pretty small
Alex Black
@AlexDBlack
May 25 2016 23:49
that's parameter numbers
Patrick Skjennum
@Habitats
May 25 2016 23:49
yeah i guess it's proportional to parameters
Alex Black
@AlexDBlack
May 25 2016 23:49
right, multiply/divide by 4 bytes :)
there's also updater info (momentum etc history), one value for each parameter
so double the parameters overall
Justin Long
@crockpotveggies
May 25 2016 23:51
in a gigabit network this doesn't seem so bad though
Alex Black
@AlexDBlack
May 25 2016 23:51
right, as long as serialization is efficient
which I suspect is a long way from optimally efficient right now
Justin Long
@crockpotveggies
May 25 2016 23:54
I'm asking some distributed computing friends about best practices for streaming the model updates
I'll include serialization
Samuel Audet
@saudet
May 25 2016 23:57
@crockpotveggies There's version within glibc libraries. Check the output readelf -V /usr/lib64/libstdc++.so.6. So, you're having problems with Ubuntu 14.04 or what version of Linux is that?
Justin Long
@crockpotveggies
May 25 2016 23:57
@saudet it's Ubuntu trusty
one sec I'll give you the output of that