First stop the bleeding, then decide whether the design itself needs surgery.
As I've experienced dozens of times, the failover script failing, stopping the bleeding is the way to go... When it works, we can move to tweaking and finding a more sophisticated solution...
Thanks for troubleshooting and finding the core issue... ;)