import poster anyways.
Dear all,
I am trying to launch a slurm job with pulsar using CLI (instead of drmaa)
The pulsar playbook below pass without problem but the analyses are still run out of slurm (bypass the scheduler).
What did I miss?
# Put your Galaxy server's fully qualified domain name (FQDN) (or the FQDN of the RabbitMQ server) above.
pulsar_root: /opt/pulsar
pulsar_pip_install: true
pulsar_pycurl_ssl_library: openssl
pulsar_systemd: true
pulsar_systemd_runner: webless
pulsar_create_user: false
pulsar_user: {name: pulsar, shell: /bin/bash}
pulsar_optional_dependencies:
- pyOpenSSL
# For remote transfers initiated on the Pulsar end rather than the Galaxy end
- pycurl
# drmaa required if connecting to an external DRM using it.
- drmaa
# kombu needed if using a message queue
- kombu
# amqp 5.0.3 changes behaviour in an unexpected way, pin for now.
- 'amqp==5.0.2'
# psutil and pylockfile are optional dependencies but can make Pulsar
# more robust in small ways.
- psutil
pulsar_yaml_config:
conda_auto_init: True
conda_auto_install: True
staging_directory: "{{ pulsar_staging_dir }}"
persistence_directory: "{{ pulsar_persistence_dir }}"
tool_dependency_dir: "{{ pulsar_dependencies_dir }}"
# The following are the settings for the pulsar server to contact the message queue with related timeouts etc.
message_queue_url: "pyamqp://galaxy_au:{{ rabbitmq_password_galaxy_au }}@{{ galaxy_server_url }}:5671//pulsar/galaxy_au?ssl=1"
managers:
_default_:
type: queued_cli
job_plugin: slurm
native_specification: "-p batch --tasks=1 --cpus-per-task=2 --mem-per-cpu=1000 -t 10:00"
min_polling_interval: 0.5
amqp_publish_retry: True
amqp_publish_retry_max_retries: 5
amqp_publish_retry_interval_start: 10
amqp_publish_retry_interval_step: 10
amqp_publish_retry_interval_max: 60
# We also need to create the dependency resolver file so pulsar knows how to
# find and install dependencies for the tools we ask it to run. The simplest
# method which covers 99% of the use cases is to use conda auto installs similar
# to how Galaxy works.
pulsar_dependency_resolvers:
- name: conda
args:
- name: auto_init
value: true
galaxy.util.task DEBUG 2021-06-27 06:55:53,852 [pN:main.web.1,p:21179,w:1,m:0,tN:HistoryAuditTablePruneTask] Executed periodic task HistoryAuditTablePruneTask (73.014 ms)
Gracefully killing worker 1 (pid: 21179)…
My apologies for the lengthy-ish post. But I am getting to a wit's end, and need some sage advice from y'all in regards to what to try next.
The error:
A 500 redirect error that crashes first at the Galaxy level. Then, it cascades down the entirety of our services (ie rancher goes down, then everything goes down).
Our (the CPT) services have been intermittently dropping over the past few weeks due to some space issues on our head node. I've been resolving it by finding old, large files/dirs on our head node, removing it, and rebooting our head node. Our services will come up, and everything works until space gets eaten up again. Note the df -h command below; it has a fair amount of GB worth of space available but shows far less than what I would think is available and is saying it's 100% used.
A way to initiate the problem:
We have a head node that works with 4 other nodes for compute services for galaxy (and some other web services). When I run a docker container (new web service) on another node, it'll work for like 10 minutes, then everything starts cascading to failure. Since the most recent problem, I've just shut that web service down. Perhaps log files are thick.
A new problem emerges:
We have a problem with condor not connecting to all of our compute nodes, and it's likely due to the space issues of ONE of our compute nodes (see the compute-2 df -h readout). The past few days I have just had to manually go in and run the condor_master command on each node. It gets it back up. But then, it'll just drop eventually; and now it's not even connecting (which, I'm NOT a condor wizard :grimacing: )
What I have done:
I'm starting to pull my hair out since this issue keeps creeping up. And we're getting different issues now, which I'm not certain are separate OR together.
See thread for readouts (THANKS IN ADVANCE FOR READING AND THINKING ABOUT THIS <3)
Hi folks, I have been trying to install several tools (in this example, bam_to_bigwig v0.2.0) in our Galaxy instance (20.01 on CentOS 7, built from source). The tool starts to install package dependencies, but then gets to a point where it dumps the following:
install_environment.STDOUT DEBUG 2021-06-28 16:48:05,682 [p:36595,w:1,m:0] [Thread-9] b''
install_environment.STDOUT DEBUG 2021-06-28 16:48:05,685 [p:36595,w:1,m:0] [Thread-9] b''
install_environment.STDOUT DEBUG 2021-06-28 16:48:05,686 [p:36595,w:1,m:0] [Thread-9] b''This can only be resolved by restarting Galaxy. While ideally, we would like to move to the latest version of Galaxy using Ansible (and then not have to deal with these kind of tool dependency problems), we're not in a position to upgrade yet.
While it seems very similar to this issue (galaxyproject/galaxy#9328), the fixes were implemented, so don't know where the problem lies.
Any thoughts on how to attack this? Happy to provide any details from log files. Thanks all
git pull if you're on release_20.01), or check your Galaxy installation to see if the changes shown in https://github.com/galaxyproject/galaxy/pull/9424/files have been applied.