Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Stanislav Zemlyakov
    @rcd27
    Is there a way to expand NdArray with one dimension?
    Adam Pocock
    @Craigacp
    Yes, but as it's stored as a blob of memory all it does is change the shape variable, so I don't think that it's likely to fix your issue (assuming you're using code similar to what you posted above)
    Stanislav Zemlyakov
    @rcd27
    @Craigacp I'll try to solve that from another side: train model for it to expect shape = (30, 30, 3)
    Adam Pocock
    @Craigacp
    No, I think you probably want the batch size to be the dimension. An ndarray of shape (1, 30, 30, 3) has the same data as one of shape (30, 30, 3), but you should check that you're passing in the pixels in the right order, and that your data is laid out in row major order.
    Stanislav Zemlyakov
    @rcd27
    @Craigacp I've got a 1.0 value from one of resulting TFloat elements!
    Stanislav Zemlyakov
    @rcd27
    That's a big deal for me, a person who knows a bit more than nothing about how TF works. Thank you guys so much! I'll continue to implement my idea. But what I also want to do is to understand TF. Any recommendations about courses?
    I understand Python, but my main stack is Java/Kotlin (so I will probably train models in Python infrastructure and use them in Kotlin)
    Stanislav Zemlyakov
    @rcd27
    I would have never got this done if not tensorflow/java and this open community, which helps people. Damn... Special thanks to @karllessard and @Craigacp
    Stanislav Zemlyakov
    @rcd27
    @karllessard I've updated your stackoverflow answer: https://stackoverflow.com/a/67289545/6748943
    Karl Lessard
    @karllessard
    @rcd27 , it is pretty common to normalize the image data before converting it to a float as well, i.e. dividing it by 255.0f. See here, we use TF eagerly to execute this preprocessing. If you were using Keras in Python, then your data was normalized.
    Adam Pocock
    @Craigacp
    Understanding TF is probably pretty tricky from a standing start. It depends what your goals are and your current familiarity with ML/DL.
    Adam Pocock
    @Craigacp
    I'm seeing some deeply weird non-determinism in the training code, which seems to be in the way that the TF C API generates gradients. I've opened an issue upstream, but if people are seeing non-determinism in their training runs, this might be why - tensorflow/tensorflow#48855
    jxtps
    @jxtps

    I've run into a weird Out Of Memory issue with TFJ 0.3.1. I initialize a couple of models on server startup & process a couple of images. CPU, not GPU. That works. Then the server idles for some time (half an hour?). Then if I process another image, it sometimes throws java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes (57976M) > maxPhysicalBytes (37568M) at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:695). The server was typically using <10 gigs (as reported by Java, not Pointer.physicalBytes) before this, so it's very weird that it would shoot up to almost 60 gigs with no TF usage!?

    This all started happening after I introduced a GPU microservice - the servers used to process one image per couple of seconds with CPU inference and never crashed, but now that they only process an image if the GPU microservice is being too slow (= fallback), they've all of a sudden started crashing!? The GPU microservice is holding up fine - it's processing say 1 image per second or something and not crashing.

    Now, the resident set (RES column in top) is vastly greater than the ram usage reported by java - I'm for example seeing >50gigs of RES but <10 gigs of Runtime.getRuntime.totalMemory() - Runtime.getRuntime.freeMemory().
    Any suggestions for how to track this down? Is there a way to list / enumerate the ram used by TFJ?
    jxtps
    @jxtps
    And any ideas why RES (= 50 gigs) would be so much higher than e.g. Runtime.getRuntime.totalMemory() (= 20 gigs)
    Hmm... I'm realizing we switched from G1 GC to ZGC in the interim (unrelated issues on a separate site where G1 GC would mode-switch and start using >20% CPU after 12-24 hours of significant memory churn). Also switched from Java 14 to 15 (couldn't switch to 16 due to issues with Play framework not working with that version)
    Ryan Nett
    @rnett
    @jxtps Try using G1 again maybe? tensorflow/java#315 seems like the same issue, but just w/ ZGC
    Adam Pocock
    @Craigacp
    I don't think runtime.totalMemory() includes memory allocated via JNI (e.g. inside TF or through JavaCPP).
    Adam Pocock
    @Craigacp
    What GC tuning parameters are you setting in addition to the use of ZGC? Things like Xmx, and then anything ZGC specific or -XX:+DisableExplicitGC etc. ZGC maps the memory multiple times, which is known to inflate the resident set size (https://mail.openjdk.java.net/pipermail/zgc-dev/2018-November/000540.html), throwing off JavaCPP's calculation.
    jxtps
    @jxtps
    @rnett yeah, that sounds like the culprit, thanks! @Craigacp I think you're absolutely right about that, but 30 gigs for TFJ!? I run two models, one "small" and one "large", but that's an outrageous amount of RAM being used?! The graphics card they trained on had 32 gigs, and used batches!?
    Adam Pocock
    @Craigacp
    So it might not actually be a lot of memory, just an interaction between JavaCPP's accounting (which uses Linux's resident set size, which is known to be inaccurate in ZGC's use case) and ZGC.
    jxtps
    @jxtps
    Ah, ok, yeah, 3x makes a big difference. No specific switches for ZGC, just turned it on. Xmx is set to 60% of whatever free -m returns in the Mem row, Total column
    Sounds like I'll be reverting some of those recent changes then, thanks!
    Adam Pocock
    @Craigacp
    I think that turning off JavaCPP's native memory accounting should be sufficient.
    You're actually staying within the acceptable memory limits, but the RSS based method that JavaCPP uses can't figure that out (because Linux doesn't know that either).
    Gili Tzabari
    @cowwoc
    Based on https://docs.oracle.com/en/java/javase/15/gctuning/available-collectors.html#GUID-C7B19628-27BA-4945-9004-EC0F08C76003 it sounds like one should be using the parallel or G1 garbage collectors tuned for max-throughput if you're training a model. That said, ZGC might still make sense in the context of evaluating a pre-trained model.
    Gili Tzabari
    @cowwoc
    Hey. I posted a modeling question over at https://ai.stackexchange.com/q/27657/46787. I would appreciate your advice even though it's not TensorFlow-specific.
    Jakob Sultan Ericsson
    @jakeri

    Hello, we run Tensorflow Java 0.2.0 (TF 2.3.1). And we have a model that produces an image out of the model as a byte[]. This is fine when running the model in python. We can write the bytes to a file. But when trying to do the same thing using TF Java we run into getting too few bytes from the output. I think I have managed to boil it down to unit test with TString.

        public void testTFNdArray() throws Exception {
            ClassLoader contextClassLoader = Thread.currentThread().getContextClassLoader();
            byte[] readAllBytes = Files.readAllBytes(Path.of(contextClassLoader.getResource("img_12.png").getFile()));
            NdArray<byte[]> vectorOfObjects = NdArrays.vectorOfObjects(readAllBytes);
    
            Tensor<TString> tensorOfBytes = TString.tensorOfBytes(vectorOfObjects);
            TString data = tensorOfBytes.data();
    
            byte[] asBytes = tensorOfBytes.data().asBytes().getObject(0);
    
            System.out.println("Bytes original file: " + readAllBytes.length);
            System.out.println("NdArray byte[] length: " + vectorOfObjects.getObject().length);
            System.out.println("Tensor numbytes: " + tensorOfBytes.numBytes());
            System.out.println("TString size: " + data.size());
            System.out.println("Bytes with reading from TString (WRONG):  " + asBytes.length);
        }

    This is the problem that I get with running through a real model. How should we be able to get the full byte[] out again?

    Samuel Audet
    @saudet
    The implementation of TString has changed a lot recently. Could you please try again with TF Java 0.3.1?
    Jakob Sultan Ericsson
    @jakeri
    Thanks, I will give it a try.
    We wanted to stay on the same version as the generated Python code.
    Jakob Sultan Ericsson
    @jakeri
    I rewrote the above test with 0.3.1 and results seems to be somewhat better but the process core dumps 9 out of 10 times. On my MacBook Pro BigSur 11.3.1.
    Jakob Sultan Ericsson
    @jakeri
    Seems to be enough to write TString.scalarOf("hello"); to get the coredump.
    Jakob Sultan Ericsson
    @jakeri
    This doesn't fail if I run in a linux-vm so the problem is probably isolated to OSX (our dev-env).
    Adam Pocock
    @Craigacp
    What does your version of the test look like with 0.3.1?
    Jakob Sultan Ericsson
    @jakeri
        @Test
        public void testTString() throws Exception {
            TString.scalarOf("hello");
        }
    The jvm core dumps.
    But running this in a docker linux vm the test pass.
    It fails 9 out of 10 times. And dump usually looks like this.
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x0000000133cf0fbb, pid=19618, tid=5891
    #
    # JRE version: OpenJDK Runtime Environment (11.0.2+9) (build 11.0.2+9)
    # Java VM: OpenJDK 64-Bit Server VM (11.0.2+9, mixed mode, tiered, compressed oops, g1 gc, bsd-amd64)
    # Problematic frame:
    # C  [libtensorflow_cc.2.dylib+0x8228fbb]  TF_TensorData+0xb
    #
    # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    # An error report file with more information is saved as:
    # /Users/jakob/proj/tfjava031/hs_err_pid19618.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    Adam Pocock
    @Craigacp
    Ok, I can replicate that on macOS. I wonder why it didn't get caught by the tests.
    Can you open an issue here - https://github.com/tensorflow/java/issues, we'll move the discussion over to Github.
    Jakob Sultan Ericsson
    @jakeri
    Thanks!
    Will open a ticket
    Adam Pocock
    @Craigacp
    This is very odd. I can't make a scalar TString from jshell or when using a different project, however pasting that test into the TF-Java TString tests and running it does not fail.
    3 replies
    Karl Lessard
    @karllessard
    it is strange yes, because we do have unit tests creating string tensors and these are ran on all platforms
    Jakob Sultan Ericsson
    @jakeri
    tensorflow/java#320 issue created.
    It might be a threading issue because it actually succeeded ~1 out of 10 times. Or depending on how many cores the unit test runs on.
    1 reply
    Gili Tzabari
    @cowwoc
    Hey guys, is there any documentation for https://github.com/tensorflow/java/tree/master/tensorflow-framework beyond the tiny readme file? For example, is there any public Javadoc for it equivalent to what https://javadoc.io/doc/org.tensorflow/tensorflow-core-api/latest/index.html does for the core API?
    Also javadoc.io will auto generate the javadoc for an artifact if you try to load it (which I just did for tensorflow-framework)
    Gili Tzabari
    @cowwoc
    Oh, very cool. I did not know that. Thank you!