These are chat archives for deeplearning4j/deeplearning4j/earlyadopters

14th
May 2016
Justin Long
@crockpotveggies
May 14 2016 00:23
yea dealing with some weird stuff but once I'm past it all I'm sure this is going to give some exciting results
for instance:
16/05/14 00:22:26 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
16/05/14 00:22:26 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, slave6.cluster, 40694)
16/05/14 00:22:26 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
16/05/14 00:22:26 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 3)
16/05/14 00:22:26 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/05/14 00:22:26 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, slave4.cluster, 42089)
16/05/14 00:22:26 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/05/14 00:22:26 INFO scheduler.DAGScheduler: Host added was in lost list earlier: slave4.cluster
I've got executors randomly dropping out and timing out
I actually wonder if this is a weave problem
Error sending message [message = KillExecutors(List(5))] in 2 attempts
Justin Long
@crockpotveggies
May 14 2016 01:54
@agibsonccc do you have any spark monitoring tools that would be useful for debugging network traffic?
Adam Gibson
@agibsonccc
May 14 2016 01:54
You could build that if you'd like
We're thinking about what kind of tools to build now (especially for the distro)
I haven't decided what larger companies are going to want
Justin Long
@crockpotveggies
May 14 2016 01:55
yea I'm thinking it might be necessary
Adam Gibson
@agibsonccc
May 14 2016 01:55
On prem tools are getting dangerously close to distro territory
I'd have to see what's going to be generally applicable vs what is specific to companies
Justin Long
@crockpotveggies
May 14 2016 01:55
everything I'm seeing here is along the lines of java.io.IOException: Connection reset by peer
Adam Gibson
@agibsonccc
May 14 2016 01:55
I mean that's a spark thing...
You're heavily reinventing the wheel not using cdh etc
Justin Long
@crockpotveggies
May 14 2016 01:56
hmmm
Adam Gibson
@agibsonccc
May 14 2016 01:56
this is why no one does this stuff by themselves :P
it really is spark though
none of this is specific to us
Justin Long
@crockpotveggies
May 14 2016 01:56
yea I get that. what's dangerous for me going the CDH route is I'm afraid I'm going to get locked into a tool that I don't understand and it will cost me dearly
Adam Gibson
@agibsonccc
May 14 2016 01:57
yeah I don't blame you - that's true of any software out there though :(
Justin Long
@crockpotveggies
May 14 2016 01:58
very much so
Adam Gibson
@agibsonccc
May 14 2016 01:58
You're just in a strange place
on prem hardware
startup
self provisioning
Justin Long
@crockpotveggies
May 14 2016 01:58
hahaha am I ever!
Adam Gibson
@agibsonccc
May 14 2016 01:59
You're like .0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001% of the world
Justin Long
@crockpotveggies
May 14 2016 01:59
mind you I think I would have spent just as much time learning how to use other stuff and making it work for that
I could be wrong
Adam Gibson
@agibsonccc
May 14 2016 01:59
Right which is why I'm saying build it yourself
We don' have the bandwidth to provide that stuff
I"m not doing anything more specific to spark
We're focusing on DL
If we do something like management layer you're talking to something we'd be selling
Lotta work goes in to that
esp to get "right
"
We aren't a 10000 person company with dedicated teams for various parts of a stack
I'd like to be
but that's partially what oss is for
You'd have to figure out what is worth it specific to spark (eg: email them about this kind of stuff) vs what's specific to us here
we're just a spark job
"network management for spark" isn't a tool we'd build
Justin Long
@crockpotveggies
May 14 2016 02:13
@agibsonccc to be clear I wasn't asking you to build a network profiling tool ;)
I think what I meant by asking earlier, is if you knew of any that existed
I think I've narrowed the problem, it appears to be the network switch on our rack
Adam Gibson
@agibsonccc
May 14 2016 02:16
@crockpotveggies right - but "Connection reset" is networking ;)
that's specific to spark
Justin Long
@crockpotveggies
May 14 2016 02:17
fair enough, I understand this isn't a Spark support forum
I do think it's going to come up a lot though
Adam Gibson
@agibsonccc
May 14 2016 02:21
Right but maybe for enterprise customers?
Even then our whole point is to share that support load with some other guy
That's how we will be able to operate as an actual business
It's in either the spark community's best interest or the vendors to support the whole stack
Well focus on the ML part
and integrating that the right way
not the whole stack
Justin Long
@crockpotveggies
May 14 2016 02:25
just a heads up looking at the amount of traffic with DL4J you will probably see it more often than other applications
our network switch says it's dropping packets like crazy
Adam Gibson
@agibsonccc
May 14 2016 02:25
Right and I'm not seeing that on say: AWS
Justin Long
@crockpotveggies
May 14 2016 02:25
whereas when I ran something like SparkPi it handled it like a pro
Adam Gibson
@agibsonccc
May 14 2016 02:26
where most of the startups are
yeah but a trivial problem is very different from a real workload
spark isn't good at large amounts of data
A lot of things have to be tuned
esp the amount of traffic that's let through and the like
You should see what spark + imagenet looks like
spark literally can't even handle it
we're working on a lot of that kind of stuff in the internals
but spark is a terrible fit for binary data in general
hence so much emphasis on c++ and the like
Justin Long
@crockpotveggies
May 14 2016 02:27
what I'd like to do here is transfer what I learn to you, since you (may?) see it in enterprise
Adam Gibson
@agibsonccc
May 14 2016 02:27
right sure but we also have to scope it
we can't build everything
I appreciate that but I also have to look at this as a non traditional setup
a self managed switch and hardware has its own quirks you might not see in say: a co located data center
You have such a rare setup that I have to take it with a bit of stride
There's parts of this that are great
but also things I have to be mindful of
something as simple as network throughput speeds for example
a lot of the data center internals I see have higher throughput/speeds and more ram
half of our job is to say no
the core parts are oss in the hopes that people can help us work with different configurations
but we have to figure out for ourselves what needs to be optimized
we'll be doing a spark pass next
so I'm sure we will see SOME optimizations
Justin Long
@crockpotveggies
May 14 2016 03:35
so I managed to start profiling traffic....
in the first iteration the network switch experienced 10,000+ buffer failures
hahahahh
and that was just a single interface
Justin Long
@crockpotveggies
May 14 2016 05:23
@agibsonccc picked up a near-enterprise level switch that boasts 74 Mpps which is 5x our previous one. if this doesn't solve the problem then there's a network bottleneck in Docker or Weave and I'll instead write an ansible formula for all this crap
I'll gladly share it if I go that route
Adam Gibson
@agibsonccc
May 14 2016 05:26
:D
Please do
I would love to know
Here's my thing
I need data not anecdotes :D
I'm listening and writing things down
but waiting for more info
I just don't know
You're doing a lot of R&D for us
we just need to do a lo tmore
to predict "generalizations"
I'm just hesitant to make guesses
I also KNOW that there are problems with our spark stuff
Like I said NEXT on our list
a lot of network i/o profiling etc needs to happen
but I also need to see WHAT to focus on here
Justin Long
@crockpotveggies
May 14 2016 05:28
to put things in perspective our last switch was a Cisco 3550 series which in its prime was top. however, in a single Spark job we reset the statistics and on average per network interface we saw 1,000 plus buffer output errors
Adam Gibson
@agibsonccc
May 14 2016 05:29
right but like I said we haven't seen the hardware be a problem
Justin Long
@crockpotveggies
May 14 2016 05:29
regarding data, it does look like Weave is a huge bottleneck according to some stats already available here http://paulbakker.io/docker/docker-cloud-network-performance/
Adam Gibson
@agibsonccc
May 14 2016 05:29
so do me a favor
try to find the parts of dl4j that cause that
that's what I need
I'm not going to support spark stuff ;/
I'm sure we'll run in to it
but there will be certain scenarios
Justin Long
@crockpotveggies
May 14 2016 05:31
for sure. I might be able to use something like wireshark to extract and analyze TCP packets
I think it's all non-ssl so should be easy to analyze
could also profile the JVM for network calls
Adam Gibson
@agibsonccc
May 14 2016 05:32
right
so even from that
we're only a subset of that
direct causes of that from something we do is what I"d need
Justin Long
@crockpotveggies
May 14 2016 05:32
let me ping a couple JVM buddies who might already have solved problems like this
raver119
@raver119
May 14 2016 08:21
oh, i see you're having fun with network io here :)
Adam Gibson
@agibsonccc
May 14 2016 08:31
Yeahhh...
Alex Black
@AlexDBlack
May 14 2016 10:39
@/all just merged some CNN optimizations... should be a heap faster now
liu-gmo
@liu-gmo
May 14 2016 13:16
hi, everyone. I want to know what algorithm is based on for ParagraphVectors. Could anybody can tell me?
Is it implemented based on the original paper "Distributed Representations of Sentences and Documents"
?
raver119
@raver119
May 14 2016 13:23
yes
liu-gmo
@liu-gmo
May 14 2016 13:29
@raver119 thank you!
AkshitaT
@AkshitaT
May 14 2016 16:14

Hi @agibsonccc, I ran the Bag of Words example for Spam email classification (https://github.com/AkshitaT/StackedDenoisingAutoencoder-Spam/tree/SDA-Edit). It gives me no results. Please see the gist: https://gist.github.com/AkshitaT/1af4c0c0fa83f66484e206b2e3f6fce7

On debugging, I realized the problem is, when the initialize() method is called by TfidfRecordReader, the TfidfVectorizer does generate a vocab cache, but the variables ‘record’ and ‘recordIter’ do not receive any values, and are null (https://github.com/deeplearning4j/Canova/blob/3cfd810f6ddd54fb01719167494f025e3692c522/canova-nd4j/canova-nd4j-nlp/src/main/java/org/canova/nd4j/nlp/reader/TfidfRecordReader.java#L77). I think, there’s a bug here.

PanDupa
@PanDupa
May 14 2016 18:32
Yo, Raver sent me built on windows
currently i am trying on ubuntu
raver119
@raver119
May 14 2016 18:35
ubuntu is easier
PanDupa
@PanDupa
May 14 2016 18:35
cool
raver119
@raver119
May 14 2016 18:35
just ./buildnativesomething.sh
and you're done
PanDupa
@PanDupa
May 14 2016 18:36
./buildnativeoperations.sh
raver119
@raver119
May 14 2016 18:36
there's instructions
PanDupa
@PanDupa
May 14 2016 18:36
I did
raver119
@raver119
May 14 2016 18:36
right in root folder
pay attention to deps and env variables
PanDupa
@PanDupa
May 14 2016 18:36
sure, ill check them out
raver119
@raver119
May 14 2016 18:37
@treo At iteration 10 a single iteration takes 1163 MILLISECONDS
Justin Long
@crockpotveggies
May 14 2016 18:38
@PanDupa I have a script you can use one sec
PanDupa
@PanDupa
May 14 2016 18:38
o! awsome!
You can easily add DL4J if you need 😉
PanDupa
@PanDupa
May 14 2016 18:41
sure! Unfortunately I have done all this things you done in script :( give me a second Ill upgrade all repos and remaven them, give me a second :)
PanDupa
@PanDupa
May 14 2016 19:08
java.lang.UnsatisfiedLinkError: no jnind4j in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:654)
at org.bytedeco.javacpp.Loader.load(Loader.java:492)
at org.nd4j.nativeblas.NativeOps.<clinit>(NativeOps.java:26)
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:27)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:4770)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:4716)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:148)
at org.deeplearning4j.nn.conf.MultiLayerConfiguration$Builder.build(MultiLayerConfiguration.java:324)
at org.deeplearning4j.nn.conf.NeuralNetConfiguration$ListBuilder.build(NeuralNetConfiguration.java:211)
at org.deeplearning4j.nn.layers.TestDropout.testDropoutSimple(TestDropout.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:119)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74)
Caused by: java.lang.UnsatisfiedLinkError: no nd4j in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:654)
at org.bytedeco.javacpp.Loader.load(Loader.java:483)
... 35 more
raver119
@raver119
May 14 2016 19:09
have you strictly followed instructions?
have you build javacpp manually?
i’ve used ubuntu week ago without any problems
PanDupa
@PanDupa
May 14 2016 19:10
I hope so, ye I built javacpp amnually
raver119
@raver119
May 14 2016 19:12
could youl please check, if libnd4j was actually built?
PanDupa
@PanDupa
May 14 2016 19:15
[100%] Built target nd4j
libnd4j$ bash buildnativeoperations.sh cpu
no errors
make[2]: Leaving directory '/home/dupa/deeplearning/libnd4j/blasbuild/cpu'
/usr/local/bin/cmake -E cmake_progress_report /home/dupa/deeplearning/libnd4j/blasbuild/cpu/CMakeFiles 1 2
[100%] Built target nd4j
make[1]: Leaving directory '/home/dupa/deeplearning/libnd4j/blasbuild/cpu'
/usr/local/bin/cmake -E cmake_progress_start /home/dupa/deeplearning/libnd4j/blasbuild/cpu/CMakeFiles 0
idk if thats correct
raver119
@raver119
May 14 2016 19:18
you dont need cpu keyword
just bash file
ianpjohnson
@ianpjohnson
May 14 2016 19:18

Just wondering if i am doing something wrong here

Just attempted to build from source following recent instructions :

  • sources pulled MINUTES ago
  • javacpp, libnd4j, nd4j, deeplearning4j, dl4j-0.4-examples

All seemed to build ok - using crockpotveggies script and a bit extra

I am testing examples against 3.8 and then 3.9 SNAPSHOT - i think am using openblas (7 dll's) and libnd4j.dll is in java.library.path
"[100%] Built target nd4j"

Core i5/2500K Windows 10, Openblas 16 GB

The question is: How do i know i am running the latest nd4j-native ?

LenetMnistExample
v3.8:ScoreX at iteration 0 is 2.0966430312134325 1324ms

v3.9 @ 14/5/2106 (3 times faster)
ScoreX at iteration 0 is 2.108984974297287 398ms

DeepAutoEncoderExample
v3.8: ScoreX at iteration 1 is 390.5312109375 223ms

v3.9 @ 14/5/2106 (nearly 2x slower)
ScoreX at iteration 1 is 390.6196875 376ms

StackedAutoEncoderMnistExample
v3.8: ScoreX at iteration 1 is 390.5312109375 206ms

v3.9 @ 14/5/2106 (nearly 2x slower)
ScoreX at iteration 1 is 390.6196875 352ms

DeepAutoEncoderExample
v3.8: layer 1+
ScoreX at iteration 0 is 507.11 2569ms
ScoreX at iteration 0 is 255.689 1355ms
ScoreX at iteration 0 is 128.779 653ms
ScoreX at iteration 0 is 49.872 133ms
ScoreX at iteration 0 is 15.864 284ms
ScoreX at iteration 0 is 52.061 607ms
ScoreX at iteration 0 is 130.485 1537ms

v3.9 @ 14/5/2106 (very similar to 3.8) layer 1+
ScoreX at iteration 0 is 506.785 5972ms
ScoreX at iteration 0 is 255.582 1854ms
ScoreX at iteration 0 is 128.227 999ms
ScoreX at iteration 0 is 49.556 187ms
ScoreX at iteration 0 is 15.956 663ms
ScoreX at iteration 0 is 51.85 1058ms
ScoreX at iteration 0 is 129.518 2626ms

raver119
@raver119
May 14 2016 19:19
so, in other words - rbm needs some love
PanDupa
@PanDupa
May 14 2016 19:19
made: bash buildnativeoperations.sh
same result
raver119
@raver119
May 14 2016 19:20
what’s output of nd4j mvn clean install? all success?
@ianpjohnson could you please file an issue @ nd4j with your investigation?
PanDupa
@PanDupa
May 14 2016 19:21
[INFO] nd4j ............................................... SUCCESS [ 1.665 s]
[INFO] nd4j-common ........................................ SUCCESS [ 3.519 s]
[INFO] nd4j-context ....................................... SUCCESS [ 0.163 s]
[INFO] nd4j-buffer ........................................ SUCCESS [ 0.808 s]
[INFO] nd4j-backends ...................................... SUCCESS [ 0.015 s]
[INFO] nd4j-api-parent .................................... SUCCESS [ 0.013 s]
[INFO] nd4j-api ........................................... SUCCESS [ 8.561 s]
[INFO] nd4j-jdbc .......................................... SUCCESS [ 0.019 s]
[INFO] nd4j-jdbc-api ...................................... SUCCESS [ 0.393 s]
[INFO] nd4j-jdbc-mysql .................................... SUCCESS [ 0.502 s]
[INFO] nd4j-native-api .................................... SUCCESS [ 0.374 s]
[INFO] nd4j-backend-impls ................................. SUCCESS [ 0.073 s]
[INFO] nd4j-native ........................................ SUCCESS [ 7.808 s]
[INFO] nd4j-instrumentation ............................... SUCCESS [ 0.474 s]
[INFO] nd4j-perf .......................................... SUCCESS [ 0.384 s]
[INFO] nd4j-serde ......................................... SUCCESS [ 0.015 s]
[INFO] nd4j-jackson ....................................... SUCCESS [ 0.397 s]
[INFO] nd4j-bytebuddy ..................................... SUCCESS [ 0.452 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 26.066 s
[INFO] Finished at: 2016-05-14T21:21:00+02:00
ianpjohnson
@ianpjohnson
May 14 2016 19:22
Noticeably, CPU usage settles at aroun 40% - is that normal ?
raver119
@raver119
May 14 2016 19:22
could you please check, if you suddenly running 32bit version of your ide? happens sometimes with intellij
@ianpjohnson no
PanDupa
@PanDupa
May 14 2016 19:22
sure, 1 sec
ianpjohnson
@ianpjohnson
May 14 2016 19:24
Thats what i thought - b4 i ran openblas it used to use low CPU then jacked up when i finally got the 7 or so dll's in place (and several times speedup) - now cpu is back down i am thinking i am not running openblas (or native) and am just seeing latest CNN improvements in deeplearing4j ?
If thats what it smells like ill hack around with library path & dll's - i ran v3.8 from eclipse but now am running v3.9 with mvn exec:java so need to be triply sure DLL's are being seen - i just wanted a sanity check first
ianpjohnson
@ianpjohnson
May 14 2016 19:29
The eclipse launched DeepAutoEncoder on v3.8 is using about 50+% CPU, the mvn launched v3.9 is using about 40% CPU
raver119
@raver119
May 14 2016 19:31
3.8 cpu use isn't actual
what's your env openblas variable?
ianpjohnson
@ianpjohnson
May 14 2016 19:33
I have never used one - i just used to make sure the 7 DLL's were in place - i suspect that might be the problem - just going to the site to make sure it is installed properly
Justin Long
@crockpotveggies
May 14 2016 19:39
FYI straight from the SparkEarlyStoppingTrainer:
16/05/14 19:32:15 WARN earlystopping.BaseSparkEarlyStoppingTrainer: Early stopping training terminated due to exception at epoch 0, iteration 0
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11 tasks (1043.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
lots of configs that I'm discovering ;)
PanDupa
@PanDupa
May 14 2016 19:45
Yee :( I have downloaded https://www.jetbrains.com/idea/download/#section=linux community version
seems to be 64bit
jvm is also 64bit
raver119
@raver119
May 14 2016 19:46
nono
with intellij even if you download 64 bit
PanDupa
@PanDupa
May 14 2016 19:47
hmm?
raver119
@raver119
May 14 2016 19:47
there's also 32 bit launcher
PanDupa
@PanDupa
May 14 2016 19:47
aww
raver119
@raver119
May 14 2016 19:47
so please check
which one you're using
PanDupa
@PanDupa
May 14 2016 19:47
sure, where is that in settings?
raver119
@raver119
May 14 2016 19:47
it's not in settings, just check your launcher icon
where it points to
idea64.sh or 32
PanDupa
@PanDupa
May 14 2016 19:49
i am running from bin/idea.sh, there is no 64 or 32
raver119
@raver119
May 14 2016 19:50
that's 64
should be
no ideas beyond that
i've spent week on windows, so might be i dont know about some changes :(
PanDupa
@PanDupa
May 14 2016 19:52
shit :( ok then :( Ill try to fight further :(
thx for try ;)
PanDupa
@PanDupa
May 14 2016 19:59
(OpExecutioner)opExecutionerClazz.newInstance(); what is that OpExecutioner?
ChrisN
@chrisvnicholson
May 14 2016 21:44
check out the nd4j javadoc for that: http://nd4j.org/doc/
Screen Shot 2016-05-14 at 2.43.57 PM.png
ianpjohnson
@ianpjohnson
May 14 2016 22:00
Do we have any BALL PARK figures on what to expect performance wise from v3.9-native vs v3.8 on, say, CNN, RNN & autoencoders yet ? (I see CUDA is still very much in the air) - as part of my sanity check
Adam Gibson
@agibsonccc
May 14 2016 22:00
not really
raver sorta does?
we haven't tested it for all possible permutations fo cases
of*
frankly I'm not going to let that hold up a release
ianpjohnson
@ianpjohnson
May 14 2016 22:02
no prob - as i said its's not a complaint, just to establish i am not doing something wrong - thx
Adam Gibson
@agibsonccc
May 14 2016 22:03
haha I know
we definitely still have work to do yet
we'll be continuing perf improvements for a bit yet
still some low hanging fruit
raver119
@raver119
May 14 2016 22:04
cuda situation changes twice a day
ianpjohnson
@ianpjohnson
May 14 2016 22:04
Looks like i might be warming up my GPU then :)
raver119
@raver119
May 14 2016 22:04
right now i've got problem with speed - when it came to 1k ms on benchmark, gc is almost idle
ianpjohnson
@ianpjohnson
May 14 2016 22:05
Nothing special - a GTX 970 and a GTX 780 - and not even gaming on them - lol
raver119
@raver119
May 14 2016 22:05
so 1.2k ms -> 1k ms is kinda cool
but damn
what should i with that now :/
ianpjohnson
@ianpjohnson
May 14 2016 22:06
Call the Aeron guys - lol
PanDupa
@PanDupa
May 14 2016 22:22
Yes ChrisN checked that. I thought it might be sth with openblas :/ still I couldn't resolve that problem, anyone?
some ideas?
Adam Gibson
@agibsonccc
May 14 2016 22:26
@PanDupa why are you trying to monkey with the internals?
Just use Nd4j.getOpExecutioner()
we set it up for you
that's meant to be a singleton
At some point we'll make that a bit cleaner with a context
we have that in the internals now but haven't really exposed/documented in a meaningful way for users
FWIW though, the purpose of an nd4j backend is to set all that up for you
you really shouldn't need it
PanDupa
@PanDupa
May 14 2016 22:28
well
Adam Gibson
@agibsonccc
May 14 2016 22:28
There's no "well" ;)
Just: use it as is
PanDupa
@PanDupa
May 14 2016 22:29

java.lang.UnsatisfiedLinkError: no jnind4j in java.library.path

at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:654)
at org.bytedeco.javacpp.Loader.load(Loader.java:492)
at org.nd4j.nativeblas.NativeOps.<clinit>(NativeOps.java:26)
at org.nd4j.linalg.cpu.nativecpu.ops.NativeOpExecutioner.<init>(NativeOpExecutioner.java:27)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at java.lang.Class.newInstance(Class.java:442)
at org.nd4j.linalg.factory.Nd4j.initWithBackend(Nd4j.java:4770)
at org.nd4j.linalg.factory.Nd4j.initContext(Nd4j.java:4716)
at org.nd4j.linalg.factory.Nd4j.<clinit>(Nd4j.java:148)
at org.deeplearning4j.nn.conf.MultiLayerConfiguration$Builder.build(MultiLayerConfiguration.java:324)
at org.deeplearning4j.nn.conf.NeuralNetConfiguration$ListBuilder.build(NeuralNetConfiguration.java:211)
at org.deeplearning4j.nn.layers.TestDropout.testDropoutSimple(TestDropout.java:44)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:119)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74)

Caused by: java.lang.UnsatisfiedLinkError: no nd4j in java.library.path
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1864)
at java.lang.Runtime.loadLibrary0(Runtime.java:870)
at java.lang.System.loadLibrary(System.java:1122)
at org.bytedeco.javacpp.Loader.loadLibrary(Loader.java:654)
at org.bytedeco.javacpp.Loader.load(Loader.java:483)
... 35 more

what about this exception then>
Adam Gibson
@agibsonccc
May 14 2016 22:29
That has zero (eg: nothing) to do with the internals
java.library.path is a jni thing
You didn't compile or setup libnd4j
if you don't want to monkey around with compiling c++ wait till monday please
otherwise feel free to set this up: https://github.com/deeplearning4j/libnd4j
PanDupa
@PanDupa
May 14 2016 22:30
I think that i have set up it already
Adam Gibson
@agibsonccc
May 14 2016 22:30
So you followed those instructions?
No matter what in all situations
that error message isn't specific to our internals
that's a general jni problem
No one else is running in to that
I almost guarantee it's specific to how you set it up
PanDupa
@PanDupa
May 14 2016 22:31
i did follow instructions
Adam Gibson
@agibsonccc
May 14 2016 22:31
right
But I guarantee there's something off with your env/how it's being run
PanDupa
@PanDupa
May 14 2016 22:32
You are probably right since i got java.lang.UnsatisfiedLinkError
xD
Adam Gibson
@agibsonccc
May 14 2016 22:32
All I can say right now is write a blog post (I mean a github issue) on your environment and make it so we can reproduce it
I know for a fact I'm right
We've had a lot of people running this
I have to have a bit of skepticism
it's not 1 off
Go here: https://github.com/deeplearning4j/libnd4j/issues and we'll see what we can do
click new issue
PanDupa
@PanDupa
May 14 2016 22:34
K I can make this issue. What informations do I need to provide>
?
Adam Gibson
@agibsonccc
May 14 2016 22:34
Everything
your ide
java version
OS
the code you're trying to run if possible (maybe in a stripped down form)
how you're trying to runi t
run it*
relevant environment variables
PanDupa
@PanDupa
May 14 2016 22:35
k
What happens in monday? Will that be easier to run that?
Adam Gibson
@agibsonccc
May 14 2016 22:35
A release to maven central etc?
I mean like I said
there's over 90 people in this channel
we've had a lot of people testing this with us
I highly doubt what you're doing is that special
You never know (hence me telling you to file an issue etc)
but I have to have some level of skepticism about linkage at this point
javacpp shouldn't be a problem at this point
PanDupa
@PanDupa
May 14 2016 22:37
Ye, ye I know
Adam Gibson
@agibsonccc
May 14 2016 22:37
environment will help a lot here
if there is something wrong somewhere we can update the docs with missing instructions
not that huge of a deal, I just have to be careful calling it a bug at this point
PanDupa
@PanDupa
May 14 2016 22:38
what about cuda in that instructions of libnd? I ommited them since i want run on cpu
Adam Gibson
@agibsonccc
May 14 2016 22:39
Put all of that in a github issue
I have a plane to catch in a bit
that's the kind of stuff we need
worse case scenario we close it really quickly because the issue is obvious
I'd rather have an info dump than you trying to figure out what to report
PanDupa
@PanDupa
May 14 2016 22:40
Ok
Adam Gibson
@agibsonccc
May 14 2016 22:40
thanks
PanDupa
@PanDupa
May 14 2016 22:58
holy shit! It works now... I know what was the issue.
ianpjohnson
@ianpjohnson
May 14 2016 22:59
You were lacking faith :-)
PanDupa
@PanDupa
May 14 2016 22:59
yeeaa.
Adam Gibson
@agibsonccc
May 14 2016 22:59
good to hear
These things are usually weird/one off
What was it?
PanDupa
@PanDupa
May 14 2016 23:02
Firstly I was getting another error. (the cause was wrongly installed openblas) and had issues with wrong backend, I was struggling not to run on native backend ans I saw in pom.xml nd4j-jcublas. Thought it is necessary for cuda so I commented it and forgot about that. I reinstalled openblas and got that error ;d few hours later uncommented that and works fine now. My fault.
I am done. Goodnight everyone :)
raver119
@raver119
May 14 2016 23:34
@treo At iteration 15 a single iteration takes 495 MILLISECONDS
Alex Black
@AlexDBlack
May 14 2016 23:36
@raver119 that's RNNs?
Justin Long
@crockpotveggies
May 14 2016 23:51
I understand that Spark isn't optimized so anyways...I noticed that iteration 0 is taking 20 minutes on my Spark cluster whereas on my machine it takes around 1-5 minutes...
given that gap, is this a topology issue or is the SparkDL4JMultilayer just not that optimized yet?
Adam Gibson
@agibsonccc
May 14 2016 23:52
I mean we really just don't have documented numbers
I haven't profiled our spark traffic with wireshark or did any insane load testing
we really just haven't had the bandwidth
I don't even have a good grasp on where the problems are yet
Just haven't looked at it
We knew there were memory limitations when trying to run this thing on imagenet
This would take a week of benchmarking likely (assuming no dev time at all, not possible) to nail the question you're asking here
there's too many configurations for us to look at
From nailing what workloads to be looking at to batch sizes/partitions
we really just don't have numbers
Justin Long
@crockpotveggies
May 14 2016 23:55
fair enough
I'm skeptical about my cluster topology in a number of ways
so let me try a couple things:
1) ditch Docker/Weave virtual network
Adam Gibson
@agibsonccc
May 14 2016 23:55
We aren't something to validate your cluster config against
Justin Long
@crockpotveggies
May 14 2016 23:56
yup
Adam Gibson
@agibsonccc
May 14 2016 23:56
if we had control variables maybe
Justin Long
@crockpotveggies
May 14 2016 23:56
2) go native and try Spark standalone
the benefit of me having a proper cluster set up is I can start to feed control variables (as discussed in the past) and provide profiling data
Adam Gibson
@agibsonccc
May 14 2016 23:59
What I'd like are different workloads with different heap sizes for executors
We also need to profile activity on the driver as well