Monitoring process/processes


#1

Is there a built-in way to monitor workers? I’m not talking about running them under supervisor or similar which would probably work tho I haven’t tried yet but more something a ping or anything that will respond with a pong if workers are alive and crunching. I can probably live fine for now with running even a simple ps aux | grep dramatiq on the worker instances but I’m more worried about the fact that some workers could hang tho being up but not processing any message. I went through this when I was using celery and I had nightmares so I am curious to know if there is some mechanism to achieve that and eventually, if not, if there is a possibility to implement it? Thanks!


#2

One option is the Prometheus middleware:

It creates a webserver on each dramatiq instance that exports metrics about messages processed, errors, etc.

You can use that with a Prometheus server to monitor the number of messages being processed, or get a count of message errors. Probably best to couple it with something watching the queue, like https://github.com/kbudde/rabbitmq_exporter. You could have an alert for a high waterline on your queue, or alert if dramatiq_messages_total flatlines.

An alert rule might look like this if we combine those:
sum(rate(dramatiq_messages_total[5m])) == 0 and sum(rate(rabbitmq_queue_messages_delivered_total[5m])) > 0


#3

I have already my own rule system (using lambdas and rabbitMQ HTTP API) to monitor queue size and the problem is that I would avoid to deploy more instances with prometheus/grafana. Will the only way to achieve that to call http://127.0.0.1:9191 and parse manually the text returned to find the number of messages?

Another thing I have in place is that actually the LB ping a nano webserver on my machines that pings celery workers and if ping isn’t successful the LB shuts down the docker container and keeps a new one up, that’s why in my case the local webserver would be more than enough.

Still having a webserver only for this reason is a bit of overhead I think


#4

As long as you’re running the server on a background thread and as long as you’re not making requests to it too often (say once a minute or once every 5 minutes or w/e), then the overhead should be absolutely tiny because the vast majority of the time the thread will be asleep waiting on accept(2).

It would help if you could describe what the ideal functionality would be for your use case here. I’ve thought about supporting things like sending a signal to the main process to have it dump stack traces from all the running threads to stderr/logging, but I doubt that would be helpful in your case.


#5

The thing I was thinking of is more like asking straight to the main process if all sub processes are alive and responsive. To me if a worker is free to respond it’s enough to say “ok, it’s free to crunch tasks”.Let’s pretend one process is dead-locked or unresponsive and cannot process more tasks, I’d like to be able to see this problem and react (restarting supervisor, maybe the entire container).
This could be a command like ping_workers that returns a list of known workers and the state (OK|UNRESPONSIVE) with a timeout, I get that the idea of being responsive can vary from person to person, that’s just one quick thing.
The problem with parsing prometheus data is that the amount of logic to achieve this goal would be much more as I should store the statistics somewhere and then compare them between different queries to have an accurate understanding of the worker processing situation, while the only thing I’m interested in is worker is responsive and running or not so we have to restart it

Not sure if I was able to explain myself well @bogdan