The project will try to hire a DevOps or a system administrator person
devops makes me think they'll be a k8s person, I hope systems-wg can have a say in that so we can get someone who could maybe accomplish the goals of our WG (like pulsar, remote data, etc.)
As you know, there are currently two main distinct approaches for deploying Galaxy: Ansible/Terraform and Kubernetes. An issue here is that much of the work is replicated in developing and maintaining these two approaches. We recognize the administration of our deployments is understaffed, regardless if it’s one of the usegalaxy.* servers or AnVIL for example. We are looking at hiring more help but the reality is that there will always be more work than available effort. To help this situation, we’d like to see more synchronization between the deployment approaches and consolidation on parts of the deployment stack.
With that, we’d like to see the Admin and Deployment working groups merge into a new Systems working group. The aim here is to find synergies between the deployment approaches. All of you as members of these working groups have complementary experience and expertise that can be leveraged to create more robust solutions. We would like to encourage open-minded and strategic thinking on how Galaxy instances are deployed, starting by creating an inventory of the components that make up Galaxy installations. Then, deciding if a single deployment strategy for any individual component is achievable, realizing it, and deploying it across different environments. Even if the number of such components is minimal, engagement within one group will raise the level of understanding between the members for different deployments, strategies, and use cases.
Furthermore, problems encountered during running instances should not be hacked around in deployments, but rather fixed upstream. For this it is essential the Systems working group engage closely with the backend working group. Every problem that is fixed upstream will make your deployments and the ones by the community easier. This is also where long-term strategic thinking comes into play where Galaxy architectural decisions can, and should, be made to ease our deployments.
Let us know what you think or if you have alternative suggestions.
problems encountered during running instances should not be hacked around in deployments, but rather fixed upstream
I really love this in theory, but as an ex-admin, I never had time to do that :(
I filed lots of issues during my term with EU, but we ran SQL on cron to work around a number of issues that I simply wasn't knowledgeable enough to fix in the codebase. As far as solutions, is there any chance that some devs could be made available to the admin group? I think we're very heavy on the admin side and light on the people who can upstream solutions
.*just because pulsar wasn't working for our use case yet.
I guess I could just paste it here as well...
Dear Admin + Deployment people:
We have two main deployment approaches: (1) Ansible and (2) Helm driven. The group responsible for each approach (the first is critical for usegalaxy.*; the second is for AnVIL). Both groups are understaffed in terms of system administration and ideally need two new sysadmins each. The reality is that we do not have resources to afford two additional sysadmins considering that we have serious deficiencies in UI and so on. On top of this the PIs are increasingly removed from fine technical details due to a number of factors related to day-to-day needs.
So, here is what we are asking: can you first develop a sufficiently detailed "map" of what is involved in each type of deployment (usegalaxy.* versus AnVIL). Once such a map exists can you find common components or other ways in which two approaches can be engineered and maintained in parallel as much as technically possible.
Finally, we need to reexamine our previous technical choices and see if they still serve our needs. For example, why do we complicate the tool installation on usegalaxy.org in such a way that we need to have CVMFS for tools, is that a shortcoming in our backend that makes large scale deployments harder than it should be? Why does Anvil design is an "Airbus A380" deployment for a
single-user-use case? Is that over-architectured for that specific use-case?
Have you decided on your meeting schedule yet?
Thanks A + B + E
But I would suggest that a better distinction between deployment methods would be (1) Bare-Metal, and (2) Kubernetes. Anvil is a bit of an odd duck in that it is a singe user Kubernetes deployment, but at heart it is just a Kubernetes deployment. As to which is the Airbus; for comparison, the Ansible playbooks to setup Galaxy contain 4K lines of YAML while the Galaxy Helm chart contains 2.4k lines of YAML.
As for the question of what to do with the Gitter channels, I would vote for archiving the wg-deployment and wg-admin channels and starting a wg-systems channel just to make it clear for posterity what happened. Although renaming one or the other works for me as well.