Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    Kiran Telukunta
    @telukir_gitlab
    ok .. then I was looking old bugs of openstack .. its great if it is working .. yes, I was looking for cofest ..
    I have not yet tried in openstack but would like to try .. as you were saying I need to give openstack configuration .. please let me know which information is required for cloudlaunch
    Alexandru Mahmoud
    @almahmoud
    Where were you looking out of curiosity? If there are outdated issues on CB that imply things don't work we can/should close them
    The info that we need is Auth URL, region name & ID, and zone name & ID as applicable
    If there are different zones for different resources (eg: we have zone-r1, zone-r2 for compute and nova for storage/networkin) we need that info to make a mapping
    Alexandru Mahmoud
    @almahmoud
    image.png
    For reference/example from our use
    Essentially if you provide that information we can try a launch same as we do for our OS clouds, after which we can see if any problems arise from your specific configuration and address them as they come up
    If you're having problems finding any, lmk and I can help you locate them
    Nate Coraor
    @natefoo
    Just a heads up, someone from this group might be able to provide a good answer to this q: https://help.galaxyproject.org/t/galaxy-local-installing/4384
    Alexandru Mahmoud
    @almahmoud
    Thanks Nate for pointing that out! I just answered it
    Nate Coraor
    @natefoo
    Cool, thanks!
    Alexandru Mahmoud
    @almahmoud

    New PR for probes (including adding probes for workflow handlers, and adding startup probes). Particularly interested in any feedback on the proposed default times: https://github.com/galaxyproject/galaxy-helm/pull/163#issuecomment-698637826 @nuwang @luke-c-sargent @pcm32

    TLDR:

      startupProbe:
        initialDelaySeconds: 30
        periodSeconds: 5
        failureThreshold: 720
        timeoutSeconds: 5
      readinessProbe:
        periodSeconds: 10
        failureThreshold: 12
        timeoutSeconds: 5
      livenessProbe:
        periodSeconds: 10
        failureThreshold: 30
        timeoutSeconds: 5

    i.e.
    Max startup time: ~1 hour
    Max unresponsiveness before container is taken out of circulation for new traffic: ~2 min
    Max unresponsiveness before container is restarted: ~5 min

    Dannon
    @dannon
    Total startup time seems really high to me; if this is over ~15 minutes something is very wrong, right?
    Alexandru Mahmoud
    @almahmoud
    I wouldn't put it as low as 15. If the machine is slow and/or has a slow/further away connection (so CVMFS is slower), it can take longer than what we would expect running on commercial clouds. Even just on jetstream it's been historically slower than AWS/GCP. There's also the possibility users are running with different settings that require longer startup (eg: enabling conda) and/or running with a slower filesystem, etc... I originally thought 30 would be good, so it's around double what we consider normal-ish startup time, but then thought making it an hour is safer for any scenario, but I do think it's very high, so would be inclined to lower it but not sure how much to lower it
    Dannon
    @dannon
    Hrmm. Maybe? Just thinking as a user, if this is taking more than 10 minutes to start I'm probably killing it and doing something else.
    I'm certainly not walking away for an hour to check back later.
    Alexandru Mahmoud
    @almahmoud
    For single-user setups, I definitely agree. But admins could use this to deploy on arbitrary infrastructure with arbitrary settings. Our slowest startup on GKE for 20.09 is 12 min, so 15 min would only give 3 min leeway from that, which might not be enough even for jetstream clusters (although I haven't tried the new image on jetstream yet, so could be wrong, but before it was generally at least 5 minutes slower on jetstream iirc)
    Conda in particular can take a really long time especially if running on a shared filesystem with a server that is outside of the cluster, but also ultimately it's all customizable at launch, so we could lower it to something that is enough for our most common scenarios and expect people who are setting up sophisticated things to know to change this as well if they're doing something that drastically affects startup time?
    I would be comfortable with 20 min, if 30 seems too much, after testing on jetstream and nectar and making sure it's under 20 min there
    And maybe add a note in the README explaining these probes so it's easier for people to decide what to set them to at launch without having to read k8s docs to understand what each probe does
    Dannon
    @dannon
    I guess my main concern is the situation where someone waits an hour for it to just go dead, and gets billed for who knows what :)
    Alexandru Mahmoud
    @almahmoud
    But it will just cause a restart, so the cost is the same whether the container just continues or the container restarts. Also, if the process fails for any reason (eg: error in Galaxy startup), the container will restart anyway. This is only determining how long it should wait before forcing a restart. The cost is associated with the cluster, a running galaxy, idleing galaxy, or restarting galaxy within that cluster won't affect cost
    Dannon
    @dannon
    I guess there's no explicit and simple communication most of the time "hey, this took too long, we're starting over"
    Given that, just waiting an hour isn't any different than restart-looping for an hour, I guess, and probably better, so no worries.
    Alexandru Mahmoud
    @almahmoud
    I guess i'm more worried of having it too low and causing unecessary restarts than having it too high, since if the process did not error out but is still not responding, I can think of more scenarios where it needs to wait longer than where it needs to be forced to restart... but this is also just speculation so idk heh
    I do think an hour is too long and if someone has a setup that slow they should change it themselves.
    I feel 30 might be safe for most scenarios
    20 should be good for all default CVMFS configs i think
    Dannon
    @dannon
    Probably right? And, yeah, it's fine. I was thinking in a world where we'd bail as early as it's 'highly likely' it's not happening, to tell the user. But since that's not really going to happen anyway, I don't know if it's worth being super aggressive with.
    Nuwan Goonasekera
    @nuwang
    Even with a 5 minute timeout, Galaxy will eventually start after many retries right? So I think catering to the average expected case may be better. This may be a bit idealistic, but I feel like we need to enforce the fix on the CVMFS caching side, rather than making the startup probes cater to such a wide range of startup speeds.
    Eventually, we expect a <30 second startup time with the caching right? It’s just we haven’t got a mechanism to pre-generate the cache on CVMFS. That seems to be the issue we need to work on. @mvdbeek Is there a simple command that we can use like run.sh generate-tool-cache /path/to/cache to pregenerate the dogpile cache, so that administrators can do this more easily? If so, we could maybe even have an initcontainer that does this, in the event that the cvmfs cache doesn’t exist.
    Alexandru Mahmoud
    @almahmoud
    What’s the advantage of having multiple retries though? That would make the startup take even longer, vs having a bigger leeway doesn’t affect the faster but doesn’t hurt the slower. When we have the caching we can lower it definitely, cause that would be folds faster, but ideally we’d merge this by next week to have the new version for ASHG. Not sure if the caching is going to happen in that timeframe, so we’re working with 8-12 min range on GKE for now, and I’m guessing a few min faster on AWS and a few minutes slower on jetsream/nectar. Can benchmark tomorrow potentially for all of them
    Marius van den Beek
    @mvdbeek
    @nuwang we’ll have the cache on cvmfs next week, test.galaxyproject.org has the cache already on cvmfs
    it’s also not dogpile anymore
    I think 40 seconds total startup for Galaxy should be a realistic target
    Praveen Kumar Tiwari
    @Praveen1177_gitlab
    hi all
    how r u
    below types of error come ----

    Validating charts/influxdb2/latest with helm lint:

    • Running helm lint charts/influxdb2/latest
      ==> Linting /root/zeus-charts/charts/influxdb2/latest
      [ERROR] templates/: template: influxdb2/templates/ingress.yaml:1:14:
      executing "influxdb2/templates/ingress.yaml" at <.Values.ingress.enabled>: nil pointer evaluating interface {}.enabled

    Error: 1 chart(s) linted, 1 chart(s) failed
    🚫 helm lint charts/influxdb2/latest failed.

    • Running helm template in charts/influxdb2/latest
      Error: template: influxdb2/templates/ingress.yaml:1:14:
      executing "influxdb2/templates/ingress.yaml" at <.Values.ingress.enabled>: nil pointer evaluating interface {}.enabled
      🚫 helm template --values values.yaml /root/zeus-charts/charts/influxdb2/latest failed.

    🚫 Helm lint failed. Not pushing

    please suggest me what i can do
    Praveen Kumar Tiwari
    @Praveen1177_gitlab
    waiting
    Pablo Moreno
    @pcm32
    Hi guys! has anyone tested the Galaxy on k8s deployment using spot instances? I would expect it to recover and jobs with retrials should also enable dealing with this, but I haven’t actually tried it.
    Nuwan Goonasekera
    @nuwang
    @pcm32 Haven’t tried this yet. Would be great to hear how it goes
    Pablo Moreno
    @pcm32

    Hi @almahmoud, is this still the correct way of importing the helm repo?

    helm repo add galaxy-gvl https://raw.githubusercontent.com/cloudve/helm-charts/master

    thanks!

    Pablo Moreno
    @pcm32
    yeah, seems to work, please ignore me.. had a different glitch before.
    Pablo Moreno
    @pcm32

    Trying a run with an older value of the helm chart (3.4.2) and cvmfs disabled I get some hard coded CVMFS requirements that make the process fail:

    No handlers could be found for logger "__main__"
    Traceback (most recent call last):
      File "/galaxy/server/scripts/galaxy-main", line 299, in <module>
        main()
      File "/galaxy/server/scripts/galaxy-main", line 295, in main
        app_loop(args, log)
      File "/galaxy/server/scripts/galaxy-main", line 142, in app_loop
        attach_to_pools=args.attach_to_pool,
      File "/galaxy/server/scripts/galaxy-main", line 108, in load_galaxy_app
        **kwds
      File "/galaxy/server/lib/galaxy/app.py", line 115, in __init__
        self._configure_tool_data_tables(from_shed_config=False)
      File "/galaxy/server/lib/galaxy/config/__init__.py", line 1077, in _configure_tool_data_tables
        config_filename=self.config.tool_data_table_config_path)
      File "/galaxy/server/lib/galaxy/tools/data/__init__.py", line 80, in __init__
        self.load_from_config_file(single_config_filename, self.tool_data_path, from_shed_config=False)
      File "/galaxy/server/lib/galaxy/tools/data/__init__.py", line 117, in load_from_config_file
        tree = util.parse_xml(filename)
      File "/galaxy/server/lib/galaxy/util/__init__.py", line 236, in parse_xml
        root = tree.parse(fname, parser=ElementTree.XMLParser(target=DoctypeSafeCallbackTarget()))
      File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 647, in parse
        source = open(source, "rb")
    IOError: [Errno 2] No such file or directory: '/cvmfs/main.galaxyproject.org/config/shed_tool_data_table_conf.xml'

    This is against a fixed container tag for that particular timepoint of the helm chart. So we should check that CVMFS disabled is actually not letting slip any env vars or so on...

    Pablo Moreno
    @pcm32
    mmm… this is affecting all releases of the helm chart marked for 20.01
    Pablo Moreno
    @pcm32
    Ok, the default galaxy.yml added in the config map has (or had back then) a few CVMFs paths that get added even when CVMFS is disabled or they are hard coded there:
    galaxy.yml:
    ----
    galaxy:
      admin_users: you-email-user@your-email-domain.edu,other-admin@email.co.uk
      build_sites_config_file: /galaxy/server/config/build_sites.yml
      builds_file_path: '/cvmfs/data.galaxyproject.org/managed/location/builds.txt'
    there are a few of them… will open an issue.