Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Michael Merrill
    @mhmerrill
    you could tune down the percentage of memory useable by the server
    using the command line parameter
    sorry this is a finecky thing, this tracking is in place try to keep the server from crashing but sometimes after changes/PRs things get a little brittle
    Elliot Ronaghan
    @ronawho
    The other aspect is that Arkouda memory tracking is based on available physical memory, and not all memory is available to Chapel in all configs. Could you try setting export GASNET_MAX_SEGSIZE="0.95/H" to see if that helps (should prevent you from getting Chapel OOMs, but you will likely still get the Arkouda client ones)
    Michael Merrill
    @mhmerrill
    @ronawho (I think) made a memory tracking change recently to make the server go faster but I think we now have an issue
    9 replies
    maybe
    Michael Merrill
    @mhmerrill
    @dgrichardson you could open an issue if you want in the arkouda repo detailing your crash
    pleas
    please
    dgrichardson
    @dgrichardson
    There seem to be two sets of formatting in the stdout. Most have the timestamp and look like they are from some kind of logging module that formats things. The other have no timestamp and a line number from .chpl and look more like a raw printf. Maybe those are the escapees? I'll make a report if I can get it to happen again. I didn't realize how much stdout there would be, and it has scrolled off my terminal.
    Is there a way from the client to query how much memory is being used?
    Michael Merrill
    @mhmerrill
    yes, ak.get_mem_used()
    or something like that
    dgrichardson
    @dgrichardson
    I also saw this in the server stdout, although it wasn't clear it was a problem:

    Spawner: read() returned 0 (EOF)

    * Caught a signal (proc 3): SIGTERM(15)

    I don't think anything sent a SIGTERM to the server.
    Elliot Ronaghan
    @ronawho
    Any chance you hit a workload manager timelimit?
    Michael Merrill
    @mhmerrill
    do you use SLURM? it could be what Elliot said.
    Michael Merrill
    @mhmerrill
    @dgrichardson there are client-side queries to the server, documentation is found here https://bears-r-us.github.io/arkouda/autoapi/arkouda/client/index.html
    dgrichardson
    @dgrichardson
    I'm using ssh. I took the nodes out of our cluster so I'm the only one using them.
    Elliot Ronaghan
    @ronawho

    Hmm, we've had one other user getting a SIGTERM like this, but we haven't found the root cause yet. Part of the problem is just how vague the error message is.

    Do you have any more information about what was happening when this error occurred? (Time since you launched, was this in the middle of an operation or while the server was idle?)

    Michael Merrill
    @mhmerrill
    @dgrichardson what version of HDF5 are you using? we are working on support for 1.12.x , we currently require 1.10.x, @glitch
    glitch
    @glitch
    and I'm really close to getting it working for HDF5 1.12.x
    Bears-R-Us/arkouda#975
    at least from our unit/functional test perspective
    dgrichardson
    @dgrichardson
    I let arkouda's makefile download whatever version of hdf it wanted.
    glitch
    @glitch
    ah, ok, you probably got 1.10.5
    dgrichardson
    @dgrichardson
    @ronawho: The server had been up for less than an hour, there was one user connected. They were in the process of creating arrays to see what happened if we used to much memory. It wasn't clear if it happened as part of a command or in between a command. All the command messages were complete, so it didn't look like two threads printing at the same time. Do you think the "Spawner" part was related? That looks like it could be a socket getting closed and someone getting a 0 byte read. Who knows how likely that is to be the problem, but it might be easy to just look for socket operations (my guess is arkouda does not have that many, but maybe there are lots in lower layers like chapel and gasnet).
    The server has been up for ~18 hours with no problems, but we also haven't been using it as much. It feels kind of load or memory use related....
    Elliot Ronaghan
    @ronawho
    Spawner part could be related. For communication between locales in the steady-state gasnet-ibv is using InfiniBand. However, to launch onto the compute nodes and exchange information prior to IB being set up gasnet has to use "out-of-band" communication. If you're using the ssh-spawner launching is done with ssh and out-of-band comm is done with sockets.
    The spawner seeing read of 0 likely just indicates the launcher on the login node can't communicate with the server anymore, which doesn't really tell us too much. I was starting to wonder if maybe we're hitting an ssh timeout or something somewhere, but just taking a stab in the dark.
    For OOMs our runtime should produce an error message and halt, I'm not sure I'd expect that to produce a SIGTERM. But if you see that error again, any additional info you could provide would be really helpful
    glitch
    @glitch
    Hi folks, just a brief note, we released Arkouda v2021.12.02 today. You can check out the release notes here: https://github.com/Bears-R-Us/arkouda/releases/tag/v2021.12.02 Cheers!
    Louis Jenkins
    @LouisJenkinsCS
    2021-12-07:11:25:33 [arkouda_server] main Line 266 INFO [Chapel] >>> "readAllHdf" "True 1 0 False False [\"srcIP\"] | []"
    /localdisk/ljenkin4/arkouda//src/GenSymIO.chpl:521: error: halt reached - array index out of bounds
    note: index was 0 but array bounds are 0..-1
    glitch
    @glitch
    Hi @LouisJenkinsCS can you give us some details about what was going on when you ran into this error?
    • What version of Arkouda are you running on the server
    • What version of HDF5 is installed
    • What is the internal data structure of the files you were trying to read? (i.e. where they generated by Arkouda or generated somewhere else?)
      Thanks!
    5 replies
    Michael Merrill
    @mhmerrill
    We will be discussing multi-dim array views today on the Arkouda weekly call.
    Hope @bmcdonald3 and @glitch can make the call today along with anyone else who would like to be in on the conversation!
    Louis Jenkins
    @LouisJenkinsCS
    @glitch I responded in thread form btw
    glitch
    @glitch
    @LouisJenkinsCS gotcha, thanks!
    Michael Merrill
    @mhmerrill
    ATTENTION: We are no longer generating Arkouda Documentation on the Read-The-Docs site, we are only generating documentation onto GitHub Pages
    glitch
    @glitch
    Here's a direct link to the docs: https://bears-r-us.github.io/arkouda/
    You can also find the link in the README.md :thumbsup:
    Michael Merrill
    @mhmerrill
    We are NOT having an Arkouda Weekly Call today
    Happy Holidays!
    Michael Merrill
    @mhmerrill
    sorry I missed having the weekly call today
    Michael Merrill
    @mhmerrill
    Today on the Arkouda Weekly Call we will discuss the code that handles binary operators and its structure. We will also take a minute to discuss the next PRs we are going to merge.
    Michael Merrill
    @mhmerrill
    I am cancelling todays Arkouda Weekly Call
    glitch
    @glitch
    got it
    Zhihui Du
    @zhihuidu
    @glitch The BFS pull request code has been updated based on your suggestions. If you have time, please have a look and let us know your comments. If you have time, we may have a zoom discussion about it. Thanks!
    Michael Merrill
    @mhmerrill
    Today's Arkouda weekly call @bmcdonald3 will talk about his PR #1034 to refactor the binary operators