Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    natefoo
    @natefoo:matrix.org
    [m]
    @almahmoud: Did you write anything yet? I should add to that I guess.
    Helena Rasche
    @hexylena:matrix.org
    [m]
    This looks a lot like Sorting Hat and many of the features we implemented in there to be honest
    We had 'auth' for running tools as well, support for training roles going to specific places
    I'm amazed someone rewrites sorting hat, but, cool there are more people interested in this!
    for meta scheduling, when I left EU we were considering having the job runners report information back to galaxy somehow, e.g. pulsar giving regular updates on queue depth, rather than pulling from influx (and thus tying us to influx)
    I'm not sure if that ever went anywhere though, I've pinged Gianmauro separately so maybe he'll join us here to discuss as well.
    I know Nate Coraor has something similar, EU has something similar, it'd really be great to have ONE thing that meets all the goals rather than all three of us maintaing separate job sorting hats.
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Agreed. I started as sort of an overhaul of my big dynanic rule and ended up with something that overlapped both DTDs and Sorting Hat, but my goal was always (when I found the time...) to try to merge that all back with Sorting Hat so we had something mutually maintained and useful.
    Nuwan Goonasekera
    @nuwang
    What are the features that sorting hat has in addition to what DTDs have?
    I had thought that DTDs were the successor to the sorting hat, but is it the other way around?
    natefoo
    @natefoo:matrix.org
    [m]
    Helena Rasche can expand on the features, but DTDs pre-date the Sorting Hat and they have somewhat less overlap I think than the job router stuff I wrote.
    hexylena
    @hexylena:matrix.org
    [m]
    Sorting hat lacks rules. That's a big thing we just never got around to adding since we've got resources to spare. The primary thing sorting hat brought was defining multiple cluster types that had declared a bit how the jobs would get mapped, and then defining CPU/mem resources in a more generic way
    Since at one point we used a combination of slurm/condor
    Sorting hat itself, outside of the yaml, also brings the training tags, so, another similarity.
    I love the idea of rejecting big data for training (though I suspect there's enough trainings that use realistic datasets that itd be more work than benefit for EU to do)
    But point is, great that you're working on this, maybe we can all pool our requirements a bit somehow
    Nate Coraor
    @natefoo:matrix.org
    [m]
    This has been a tricky thing for me to get working with training.
    Helena Rasche
    @hexylena:matrix.org
    [m]
    Yeah that's another thing. We all run trainings differently
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Also some tools (like kraken) load entire databases that use huge amounts of memory no matter how small the input data is.
    Helena Rasche
    @hexylena:matrix.org
    [m]
    EU does separate clusters/condor classads, US does diff run time/memory limits
    And yeah that's miserable, the memory invariant tools where ok, 80% of the tools you can cut memory on for training uses. But the rest will crash and burn
    Which really requires detailed tool knowledge and decisions on how do you want to segregate different user classes without a generic answer
    Nate Coraor
    @natefoo:matrix.org
    [m]
    yup
    Nuwan Goonasekera
    @nuwang
    Thanks for the details. From what I can see, the fundamental difference in this approach is that tags are the means by which resources are mapped to destinations. The meanings of tags are not understood by the mapper itself - and by and large - neither are resources. They are entirely user defined but the routing itself is done by tag type. The purpose is to allow expressing preference/aversion to destinations, as well as to define relatively flexible pairings of resources. As such, the whole approach needs to be prototyped and validated I think - it kind of makes sense on paper - but whether it works in practice...
    natefoo
    @natefoo:matrix.org
    [m]
    Slides here, all I did was copy the Q1 final update slides and s/2021 Q1/21.09/, so the topics also need to be updated/deleted:
    And we need goals for 22.01
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Also, if someone would like to take the WG lead for the next period that would be awesome. It'd be nice if we could rotate that position.
    Alexandru Mahmoud
    @almahmoud
    I can volunteer for the rotation
    Unless anyone else volunteers :)
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Also I can present this week if you want but I bet you can present your work much better than I can.
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Anyone want to highlight their stuff for this WG in a blog post? I'm not sure that anything I did is exciting enough for that.
    Also please make any changes to the slides that you'd like included by 9:00 PM UTC on Wednesday, at which point I'll dump them into official deck.
    Nuwan Goonasekera
    @nuwang
    I made one change to the issue about IT containers and k8s user namespace remapping. Even with usernamespace remapping, we’ll probably have problems because we probably also need id remapping on the filesystem side. So I rephrased the issue as "containers shouldn’t be running as root”.
    Helena Rasche
    @hexylena:matrix.org
    [m]
    That's going to be a problem for the RStudio container I think.
    rocker-org/rocker-versioned#153 open issue about rocker which we use in there.
    Nuwan Goonasekera
    @nuwang
    How are you handling files created by root? That’s the main issue we’ve run into
    Or more specifically, files created by root, cannot subsequently be cleaned up by Galaxy
    and I think there were some other issues as well. @luke-c-sargent might be able to explain better
    Helena Rasche
    @hexylena:matrix.org
    [m]
    Ahh ok we don't have that issue. Root runs the processes, users login as a non root user (not sure what happens with I'd mapping for that user, or cleaning, bioerns job)
    But no files should be getting persisted
    They should be doing everything through the api
    Why do we mount the workdir into the container again? That seems super issue prone
    Nuwan Goonasekera
    @nuwang
    Is there an alternative to mounting the workdir?
    In general I meant.
    Nate Coraor
    @natefoo:matrix.org
    [m]
    Coexecution pods
    Nuwan Goonasekera
    @nuwang
    Oh right. Since this is through the k8s runner, that hasn’t been explored yet.
    Anton Nekrutenko
    @nekrut
    /@all: Here is the logic for merging admin and deployment. This was crafted by Björn and Enis:

    As you know, there are currently two main distinct approaches for deploying Galaxy: Ansible/Terraform and Kubernetes. An issue here is that much of the work is replicated in developing and maintaining these two approaches. We recognize the administration of our deployments is understaffed, regardless if it’s one of the usegalaxy.* servers or AnVIL for example. We are looking at hiring more help but the reality is that there will always be more work than available effort. To help this situation, we’d like to see more synchronization between the deployment approaches and consolidation on parts of the deployment stack.

    With that, we’d like to see the Admin and Deployment working groups merge into a new Systems working group. The aim here is to find synergies between the deployment approaches. All of you as members of these working groups have complementary experience and expertise that can be leveraged to create more robust solutions. We would like to encourage open-minded and strategic thinking on how Galaxy instances are deployed, starting by creating an inventory of the components that make up Galaxy installations. Then, deciding if a single deployment strategy for any individual component is achievable, realizing it, and deploying it across different environments. Even if the number of such components is minimal, engagement within one group will raise the level of understanding between the members for different deployments, strategies, and use cases.

    Furthermore, problems encountered during running instances should not be hacked around in deployments, but rather fixed upstream. For this it is essential the Systems working group engage closely with the backend working group. Every problem that is fixed upstream will make your deployments and the ones by the community easier. This is also where long-term strategic thinking comes into play where Galaxy architectural decisions can, and should, be made to ease our deployments.

    Let us know what you think or if you have alternative suggestions.

    Nate Coraor
    @natefoo:matrix.org
    [m]
    I'm going to use the wg-admin channel for systems WG discussion for now, we can decide what to do more formally with the 2 channels from there.