I have a script that has been in the “Still Running” state for 2 days. It is only suppose to run for max 30 seconds. It looks like the script had issues because my snapcast integration became unavailable at the time it ran.
My question is how to monitor or watch or alert for these situations? What happens is this stuck script now makes other Automations/Scripts fail that call this specific script. Whenever scripts fail to run I do not get notifications for those either.
Here is the script that was stuck running for 2 days
Here is an Automation that did not run. Is there a way to know if it didn’t run? I think changing the mode to parallel may be worth it, but then I would never know if there was one in a still running state.
If I was faced with this situation, I would focus on making my scripts/automations more fault-tolerant (instead of devising an automation to monitor other automations/scripts).
But if monitoring is what you want then automations have a current attribute when they’re busy executing actions. The attribute disappears when the automation is no longer busy. Perhaps you can use that to detect when the automation is “stuck” (i.e. the attribute persists for longer than some threshold you choose).
I’d be interested in making it more fault tolerant. How would I go about that? I am not sure if the sensor being unavailable is what caused it’s stuck state. I left everything the way it was so I can still investigate.
If you think the failure might be due to the sensor being unavailable you can modify the script so it first confirms the sensor’s state is not unavailable before executing the action.