Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    jxtps
    @jxtps
    @rnett yeah, that sounds like the culprit, thanks! @Craigacp I think you're absolutely right about that, but 30 gigs for TFJ!? I run two models, one "small" and one "large", but that's an outrageous amount of RAM being used?! The graphics card they trained on had 32 gigs, and used batches!?
    Adam Pocock
    @Craigacp
    So it might not actually be a lot of memory, just an interaction between JavaCPP's accounting (which uses Linux's resident set size, which is known to be inaccurate in ZGC's use case) and ZGC.
    jxtps
    @jxtps
    Ah, ok, yeah, 3x makes a big difference. No specific switches for ZGC, just turned it on. Xmx is set to 60% of whatever free -m returns in the Mem row, Total column
    Sounds like I'll be reverting some of those recent changes then, thanks!
    Adam Pocock
    @Craigacp
    I think that turning off JavaCPP's native memory accounting should be sufficient.
    You're actually staying within the acceptable memory limits, but the RSS based method that JavaCPP uses can't figure that out (because Linux doesn't know that either).
    Gili Tzabari
    @cowwoc
    Based on https://docs.oracle.com/en/java/javase/15/gctuning/available-collectors.html#GUID-C7B19628-27BA-4945-9004-EC0F08C76003 it sounds like one should be using the parallel or G1 garbage collectors tuned for max-throughput if you're training a model. That said, ZGC might still make sense in the context of evaluating a pre-trained model.
    Gili Tzabari
    @cowwoc
    Hey. I posted a modeling question over at https://ai.stackexchange.com/q/27657/46787. I would appreciate your advice even though it's not TensorFlow-specific.
    Jakob Sultan Ericsson
    @jakeri

    Hello, we run Tensorflow Java 0.2.0 (TF 2.3.1). And we have a model that produces an image out of the model as a byte[]. This is fine when running the model in python. We can write the bytes to a file. But when trying to do the same thing using TF Java we run into getting too few bytes from the output. I think I have managed to boil it down to unit test with TString.

        public void testTFNdArray() throws Exception {
            ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
            byte[] readAllBytes = Files.readAllBytes(Path.of(contextClassLoader.getResource("img_12.png").getFile()));
            NdArray<byte[]> vectorOfObjects = NdArrays.vectorOfObjects(readAllBytes);
    
            Tensor<TString> tensorOfBytes = TString.tensorOfBytes(vectorOfObjects);
            TString data = tensorOfBytes.data();
    
            byte[] asBytes = tensorOfBytes.data().asBytes().getObject(0);
    
            System.out.println("Bytes original file: " + readAllBytes.length);
            System.out.println("NdArray byte[] length: " + vectorOfObjects.getObject().length);
            System.out.println("Tensor numbytes: " + tensorOfBytes.numBytes());
            System.out.println("TString size: " + data.size());
            System.out.println("Bytes with reading from TString (WRONG):  " + asBytes.length);
        }

    This is the problem that I get with running through a real model. How should we be able to get the full byte[] out again?

    Samuel Audet
    @saudet
    The implementation of TString has changed a lot recently. Could you please try again with TF Java 0.3.1?
    Jakob Sultan Ericsson
    @jakeri
    Thanks, I will give it a try.
    We wanted to stay on the same version as the generated Python code.
    Jakob Sultan Ericsson
    @jakeri
    I rewrote the above test with 0.3.1 and results seems to be somewhat better but the process core dumps 9 out of 10 times. On my MacBook Pro BigSur 11.3.1.
    Jakob Sultan Ericsson
    @jakeri
    Seems to be enough to write TString.scalarOf("hello"); to get the coredump.
    Jakob Sultan Ericsson
    @jakeri
    This doesn't fail if I run in a linux-vm so the problem is probably isolated to OSX (our dev-env).
    Adam Pocock
    @Craigacp
    What does your version of the test look like with 0.3.1?
    Jakob Sultan Ericsson
    @jakeri
        @Test
        public void testTString() throws Exception {
            TString.scalarOf("hello");
        }
    The jvm core dumps.
    But running this in a docker linux vm the test pass.
    It fails 9 out of 10 times. And dump usually looks like this.
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x0000000133cf0fbb, pid=19618, tid=5891
    #
    # JRE version: OpenJDK Runtime Environment (11.0.2+9) (build 11.0.2+9)
    # Java VM: OpenJDK 64-Bit Server VM (11.0.2+9, mixed mode, tiered, compressed oops, g1 gc, bsd-amd64)
    # Problematic frame:
    # C  [libtensorflow_cc.2.dylib+0x8228fbb]  TF_TensorData+0xb
    #
    # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /Users/jakob/proj/tfjava031/hs_err_pid19618.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    Adam Pocock
    @Craigacp
    Ok, I can replicate that on macOS. I wonder why it didn't get caught by the tests.
    Can you open an issue here - https://github.com/tensorflow/java/issues, we'll move the discussion over to Github.
    Jakob Sultan Ericsson
    @jakeri
    Thanks!
    Will open a ticket
    Adam Pocock
    @Craigacp
    This is very odd. I can't make a scalar TString from jshell or when using a different project, however pasting that test into the TF-Java TString tests and running it does not fail.
    3 replies
    Karl Lessard
    @karllessard
    it is strange yes, because we do have unit tests creating string tensors and these are ran on all platforms
    Jakob Sultan Ericsson
    @jakeri
    tensorflow/java#320 issue created.
    It might be a threading issue because it actually succeeded ~1 out of 10 times. Or depending on how many cores the unit test runs on.
    1 reply
    Gili Tzabari
    @cowwoc
    Hey guys, is there any documentation for https://github.com/tensorflow/java/tree/master/tensorflow-framework beyond the tiny readme file? For example, is there any public Javadoc for it equivalent to what https://javadoc.io/doc/org.tensorflow/tensorflow-core-api/latest/index.html does for the core API?
    Also javadoc.io will auto generate the javadoc for an artifact if you try to load it (which I just did for tensorflow-framework)
    Gili Tzabari
    @cowwoc
    Oh, very cool. I did not know that. Thank you!
    Out of curiosity, how do I know which version of the API https://www.tensorflow.org/jvm/api_docs/java/org/tensorflow/package-summary is written against? I don't see any versioning information.
    The API changes enough that it would be helpful to know if I'm looking at outdated documentation.
    Adam Pocock
    @Craigacp
    It's the docs from 0.3 at the moment because that was when we got the docs generation working. I believe in the future we'll have the version drop down that Python has, but I'm not sure if that's been set yet.
    Gili Tzabari
    @cowwoc

    Are there any examples that show the usage of the Java framework API for basic workflow like this?

    training_images = training_images/255.0
    test_images = test_images/255.0
    
    model = tf.keras.models.Sequential([#tf.keras.layers.Flatten(),
                                        tf.keras.layers.Dense(64, activation=tf.nn.relu),
                                        tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
    
    model.compile(optimizer = 'adam',
                  loss = 'sparse_categorical_crossentropy')
    
    model.fit(training_images, training_labels, epochs=5)
    
    model.evaluate(test_images, test_labels)
    
    classifications = model.predict(test_images)

    Specifically, I don't see the equivalent of models.Sequential, model.compile, model.fit etc.

    Adam Pocock
    @Craigacp
    We haven't got an implementation of Model yet. It's being worked on. Ditto layers.
    Gili Tzabari
    @cowwoc
    Okay, let's go with the flow then... Say I want to train in python and model.predict() in Java.
    1. What is the easiest way to build a dataset in Java and pass it into python for training? I want to leverage existing Java code to retrieve the input dataset from a database. I don't want to reimplement it in python.
    2. Is there an easy way for me to load a graph in Java that was constructed + trained in python? Can you point me at any sample/test code?
    Adam Pocock
    @Craigacp
    If you want to train an MLP and specify it purely in Java you can do that.
    It's just lower level than Keras.
    For 1, you could emit it as TFRecords, but I'm not sure we've got saving support lined up for that. For 2, yes you can load in a SavedModel, see https://github.com/tensorflow/java-models
    Gili Tzabari
    @cowwoc
    Thank you. If one of my placeholders is a matrix of TFloat32, what's the best way to populate the matrix column by column? I assume I can't push values directly into the Placeholder. Do I invoke Ops.tensorArray() of the same size as the placeholder, then populate that, then eventually invoke Runner.feed(placeholder, array)? Or is there a better way?
    Adam Pocock
    @Craigacp
    Create a TFloat32 either an empty one, or by copying from some source. Then feed it to the placeholder. Note the empty one returns memory which has not been zeroed yet tensorflow/java#271.
    Gili Tzabari
    @cowwoc
    Thanks. How about creating the model in python, but running model.fit() and model.predict() from Java? Do you have any example code for doing that?
    Adam Pocock
    @Craigacp
    I had code for doing it for models emitted by TF 1. Keras's models look different and have different entry points, so while it's probably possible to pull the graph out of a function such that you could train it, it's probably quite tricky with the tools we have at the moment.
    Gili Tzabari
    @cowwoc
    I assume you guys haven't added many of the debugging tools that are present on the Python side, so I think it's best if I construct the model from python, debug model.fit() until I get something working.... then I can move model.fit() to Java and model.predict() would always sit in Java.
    Adam Pocock
    @Craigacp

    So to some extent it depends how complicated your model is. Tribuo's next release exposes TF models but wraps up all the fitting, evaluation and prediction in it's interface to make it a lot simpler. It's not the same as Keras, it's a little bit more like scikit-learn as we don't have callbacks in Tribuo.

    However TF-Java will have this in the future, it's just a lot of stuff to build with a much smaller team than the Keras team.

    Gili Tzabari
    @cowwoc
    Does it really matter how complex the model is? Don't I just need to access the inputs and outputs? I don't need to touch the hidden layers.
    Adam Pocock
    @Craigacp
    Well writing a complex model in Java is more painful than doing it in Python at the moment as we don't have Keras style layers.
    Gili Tzabari
    @cowwoc
    Right, but remember I said that for now I plan to write the model in python and just do model.fit() and model.predict() in Java.
    Adam Pocock
    @Craigacp
    Yes, but figuring out how to save out a Keras model such that you can fit it in Java is very difficult.