EOS BPs: Auto Failover for Producing Nodes

in #eos6 years ago (edited)

It is imperative for active BPs to ensure their producing nodes are reliable, and that in the event of failure they can continue to sign blocks from standby nodes without any human intervention.

Currently, there aren't too many "ideal" solutions for this - various issues have been raised on the EOS github to help make this process easier for us Block Producers, but until they have been shipped we must do the best we can with the tools currently at our disposal.

At Block Matrix we have been battle testing an automated failover solution using keepalived in the event of the nodeos processing being killed. We now have a lightweight solution in place, which auto promotes a backup node via the producer API. You can watch this in action here:

We have put together the code for this over on our Github, with some explanation around the process and a special addendum for AWS users to combat the multicast/unicast issue which will prevent a vanilla keepalived solution from working within their environment.

We have several improvements to this, catering for issues where nodeos continues to run but stalls or stops signing blocks - once we have the relevant updates from the EOS dev team we will extend our examples to include them.

Happy HA'ing to all BPs!


Block Matrix are an EOS block producer candidate, producer name: blockmatrix1