Running under HTCondor¶
The recommended way to start and stop GWCelery on the LIGO Data Grid cluster is using HTCondor DAGMan. GWCelery uses the ezdag library to dynamically generate HTCondor DAG (Directed Acyclic Graph) files that orchestrate all worker processes, the Flask web application, and Flower.
Prerequisites¶
To run GWCelery under HTCondor, you must:
Install GWCelery with the
[condor]extra to includeezdagand the HTCondor Python bindings:$ uv sync --extra condor
Start the Redis server yourself (e.g. via systemd); see the Redis configuration section for details.
DAG-based Architecture¶
GWCelery uses HTCondor DAGMan to manage its worker processes. When you run
gwcelery condor submit, it dynamically generates a DAG file along with
individual submit files for each component. These files are written to
.local/state/dag/ in your current directory.
The DAG includes the following nodes (see Design and anatomy of GWCelery for more information on the nodes).
gwcelery-beat: Celery beat scheduler for periodic tasks
gwcelery-worker: Main Celery worker for general tasks
gwcelery-flask: Flask web application
gwcelery-flower: Flower monitoring dashboard
gwcelery-kafka-worker: Worker for IGWN Alert/Kafka message handling
gwcelery-exttrig-worker: Worker for external trigger processing
gwcelery-superevent-worker: Worker for superevent management
gwcelery-embright-worker: Worker for em-bright calculations
gwcelery-highmem-worker: Worker for high-memory tasks
gwcelery-multiproc-worker: Multiprocessing worker
gwcelery-openmp-worker-01 through gwcelery-openmp-worker-15: 15 parallel OpenMP workers for BAYESTAR sky localization
HTCondor per-job stdout/stderr files for the local-universe workers are written
to .local/state/dag/log/ with names like
gwcelery-worker-<cluster>-<process>.out and .err. Celery itself writes
its own per-worker log to .local/state/log/<worker>.log via the
--logfile option.
The vanilla-universe workers (gwcelery-openmp-worker-NN and
gwcelery-multiproc-worker) run on execute nodes with a read-only
$HOME, so they cannot write a Celery --logfile directly. Instead,
their Celery output is streamed to stderr, which Condor captures and writes
to .local/state/log/vanilla/<worker>.<cluster>.log on the submit node
in real time. A new per-run file is created each time DAGMan re-submits a
worker; logrotate cleans these up.
Starting GWCelery¶
Navigate to the directory where you want log files and DAG state to be stored:
$ mkdir -p ~/gwcelery && cd ~/gwcelery
Then submit the DAG using the gwcelery command:
$ gwcelery condor submit
This creates the DAG files in .local/state/dag/ and submits the DAGMan job
to HTCondor. The DAGMan job will then submit all the individual worker jobs.
Stopping and Restarting GWCelery¶
To stop GWCelery, use the gwcelery condor rm command:
$ gwcelery condor rm
To hold (pause) GWCelery jobs, run the condor_hold command:
$ condor_hold -constraint 'JobBatchName == "gwcelery" && JobUniverse != 7'
To release (resume) held jobs, run condor_release:
$ condor_release -constraint 'JobBatchName == "gwcelery" && JobUniverse != 7'
Note that there is normally no need to re-submit GWCelery if the machine is rebooted, because the jobs will persist in the HTCondor queue.
Shortcuts¶
The following commands are provided as shortcuts for the above operations:
$ gwcelery condor submit # Submit the DAG to HTCondor
$ gwcelery condor rm # Remove all GWCelery jobs
$ gwcelery condor q # Query status of GWCelery jobs
$ gwcelery condor hold # Hold (pause) all GWCelery jobs
$ gwcelery condor release # Release (resume) all held GWCelery jobs
The following command is a shortcut for
gwcelery condor rm; gwcelery condor submit:
$ gwcelery condor resubmit # Remove and re-submit GWCelery
Managing multiple deployments¶
There should generally be at most one full deployment of GWCelery per GraceDB
server running at one time. The gwcelery condor shortcut command is
designed to protect you from accidentally starting multiple deployments of
GWCelery by inspecting the HTCondor job queue before submitting new jobs. If
you try to start GWCelery a second time on the same host in the same directory,
you will get the following error message:
$ gwcelery condor submit
error: GWCelery jobs are already running in this directory.
First remove existing jobs with "gwcelery condor rm".
To see the status of those jobs, run "gwcelery condor q".
However, there are situations where you may actually want to run multiple instances of GWCelery on the same machine. For example, you may want to run one instance for the ‘production’ GraceDB server and one for the ‘playground’ server. To accomplish this, just start the two instances of gwcelery in different directories. Here is an example:
$ mkdir -p production
$ pushd production
$ CELERY_CONFIG_MODULE=gwcelery.conf.production gwcelery condor submit
$ popd
$ mkdir -p playground
$ pushd playground
$ CELERY_CONFIG_MODULE=gwcelery.conf.playground gwcelery condor submit
$ popd
Job accounting¶
When GWCelery is started using gwcelery condor submit or gwcelery condor
resubmit, the HTCondor accounting group is set
based on which GWCelery configuration you are using:
ligo.prod.o3.cbc.pe.bayestarfor productionligo.dev.o3.cbc.pe.bayestarfor all other configurations, including playground