Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Michael Merrill
    @mhmerrill
    and the amount of physical memory we can discover from the OS
    dgrichardson
    @dgrichardson
    So sounds like everything is working and it is basically using all the RAM. :)
    One thing I had a bit of trouble with was verifying IB was being used. I ended up doing a capture on the HCA and making sure there was RDMA traffic where I thought chapel was running. In the end it looked good.
    2 replies
    I also got a warning that CHPL_TARGET_CPU was set to unknown. Is this a big deal?
    Michael Merrill
    @mhmerrill
    @ronawho ^^^
    Elliot Ronaghan
    @ronawho
    CHPL_TARGET_CPU (https://chapel-lang.org/docs/usingchapel/chplenv.html#chpl-target-cpu) basically controls CPU specialization (march in gcc). It defaults to unknown for multi-locale since we don't if chapel is cross-compiling. If your login node is the same ISA as compute nodes you could set it to native. Otherwise, setting to none will quiet the warning (though that will probably trigger a rebuild). In all the tests we've done Arkouda performance does not benefit from target architecture specialization at the moment, so leaving it unknown or setting it to none should not impact performance.
    Michael Merrill
    @mhmerrill
    i was wrong on the call about my build times: old was ~190sec, new is 275sec
    24 replies
    so that is 85sec difference
    dgrichardson
    @dgrichardson
    We got ~1.2 TB of data into the server with pretty close to 100% CPU utilization on all 4 nodes when doing operations. Sometimes when trying to use too much memory gives an error on the client. Other times it looks like the server crashes. When the crash happened the server printed this:

    home/richard/chapel/arkouda//src/RadixSortLSD.chpl:159: error: Out of memory allocating "array elements"

    /home/richard/chapel/arkouda//src/RadixSortLSD.chpl:159: error: Out of memory allocating "array elements"

    /home/richard/chapel/arkouda//src/RadixSortLSD.chpl:159: error: Out of memory allocating "array elements"

    /home/richard/chapel/arkouda//src/RadixSortLSD.chpl:159: error: Out of memory allocating "array elements"

    Michael Merrill
    @mhmerrill
    we made some changes to the memory tracking recently and these are probably escapes from the tracking
    you could tune down the percentage of memory useable by the server
    using the command line parameter
    sorry this is a finecky thing, this tracking is in place try to keep the server from crashing but sometimes after changes/PRs things get a little brittle
    Elliot Ronaghan
    @ronawho
    The other aspect is that Arkouda memory tracking is based on available physical memory, and not all memory is available to Chapel in all configs. Could you try setting export GASNET_MAX_SEGSIZE="0.95/H" to see if that helps (should prevent you from getting Chapel OOMs, but you will likely still get the Arkouda client ones)
    Michael Merrill
    @mhmerrill
    @ronawho (I think) made a memory tracking change recently to make the server go faster but I think we now have an issue
    9 replies
    maybe
    Michael Merrill
    @mhmerrill
    @dgrichardson you could open an issue if you want in the arkouda repo detailing your crash
    pleas
    please
    dgrichardson
    @dgrichardson
    There seem to be two sets of formatting in the stdout. Most have the timestamp and look like they are from some kind of logging module that formats things. The other have no timestamp and a line number from .chpl and look more like a raw printf. Maybe those are the escapees? I'll make a report if I can get it to happen again. I didn't realize how much stdout there would be, and it has scrolled off my terminal.
    Is there a way from the client to query how much memory is being used?
    Michael Merrill
    @mhmerrill
    yes, ak.get_mem_used()
    or something like that
    dgrichardson
    @dgrichardson
    I also saw this in the server stdout, although it wasn't clear it was a problem:

    Spawner: read() returned 0 (EOF)

    * Caught a signal (proc 3): SIGTERM(15)

    I don't think anything sent a SIGTERM to the server.
    Elliot Ronaghan
    @ronawho
    Any chance you hit a workload manager timelimit?
    Michael Merrill
    @mhmerrill
    do you use SLURM? it could be what Elliot said.
    Michael Merrill
    @mhmerrill
    @dgrichardson there are client-side queries to the server, documentation is found here https://bears-r-us.github.io/arkouda/autoapi/arkouda/client/index.html
    dgrichardson
    @dgrichardson
    I'm using ssh. I took the nodes out of our cluster so I'm the only one using them.
    Elliot Ronaghan
    @ronawho

    Hmm, we've had one other user getting a SIGTERM like this, but we haven't found the root cause yet. Part of the problem is just how vague the error message is.

    Do you have any more information about what was happening when this error occurred? (Time since you launched, was this in the middle of an operation or while the server was idle?)

    Michael Merrill
    @mhmerrill
    @dgrichardson what version of HDF5 are you using? we are working on support for 1.12.x , we currently require 1.10.x, @glitch
    glitch
    @glitch
    and I'm really close to getting it working for HDF5 1.12.x
    Bears-R-Us/arkouda#975
    at least from our unit/functional test perspective
    dgrichardson
    @dgrichardson
    I let arkouda's makefile download whatever version of hdf it wanted.
    glitch
    @glitch
    ah, ok, you probably got 1.10.5
    dgrichardson
    @dgrichardson
    @ronawho: The server had been up for less than an hour, there was one user connected. They were in the process of creating arrays to see what happened if we used to much memory. It wasn't clear if it happened as part of a command or in between a command. All the command messages were complete, so it didn't look like two threads printing at the same time. Do you think the "Spawner" part was related? That looks like it could be a socket getting closed and someone getting a 0 byte read. Who knows how likely that is to be the problem, but it might be easy to just look for socket operations (my guess is arkouda does not have that many, but maybe there are lots in lower layers like chapel and gasnet).
    The server has been up for ~18 hours with no problems, but we also haven't been using it as much. It feels kind of load or memory use related....
    Elliot Ronaghan
    @ronawho
    Spawner part could be related. For communication between locales in the steady-state gasnet-ibv is using InfiniBand. However, to launch onto the compute nodes and exchange information prior to IB being set up gasnet has to use "out-of-band" communication. If you're using the ssh-spawner launching is done with ssh and out-of-band comm is done with sockets.
    The spawner seeing read of 0 likely just indicates the launcher on the login node can't communicate with the server anymore, which doesn't really tell us too much. I was starting to wonder if maybe we're hitting an ssh timeout or something somewhere, but just taking a stab in the dark.
    For OOMs our runtime should produce an error message and halt, I'm not sure I'd expect that to produce a SIGTERM. But if you see that error again, any additional info you could provide would be really helpful
    glitch
    @glitch
    Hi folks, just a brief note, we released Arkouda v2021.12.02 today. You can check out the release notes here: https://github.com/Bears-R-Us/arkouda/releases/tag/v2021.12.02 Cheers!
    Louis Jenkins
    @LouisJenkinsCS
    2021-12-07:11:25:33 [arkouda_server] main Line 266 INFO [Chapel] >>> "readAllHdf" "True 1 0 False False [\"srcIP\"] | []"
    /localdisk/ljenkin4/arkouda//src/GenSymIO.chpl:521: error: halt reached - array index out of bounds
    note: index was 0 but array bounds are 0..-1
    glitch
    @glitch
    Hi @LouisJenkinsCS can you give us some details about what was going on when you ran into this error?
    • What version of Arkouda are you running on the server
    • What version of HDF5 is installed
    • What is the internal data structure of the files you were trying to read? (i.e. where they generated by Arkouda or generated somewhere else?)
      Thanks!
    5 replies
    Michael Merrill
    @mhmerrill
    We will be discussing multi-dim array views today on the Arkouda weekly call.
    Hope @bmcdonald3 and @glitch can make the call today along with anyone else who would like to be in on the conversation!
    Louis Jenkins
    @LouisJenkinsCS
    @glitch I responded in thread form btw
    glitch
    @glitch
    @LouisJenkinsCS gotcha, thanks!