I’m publishing this post a little early, as I expect to be pretty busy on Monday. Before I go into my normal reporting on the detailed coding issues that BlockTrades team worked on last week and our plans for the upcoming week, I first wanted to give a brief overview of the hardfork process as it happened last week, since that’s what’s been driving our work flow in the past week.
Review of hardfork 24 (and an “unplanned” hardfork before that)
On the date of hardfork 24 (Oct 14th), the apps developers were still making hardfork 24 related changes and reporting API problems they were finding with hivemind, while the BlockTrades team was working on fixing those bugs as they were reported (I’ll discuss the bug fixes later in this post). Meanwhile, the top 20 witnesses were standing by, waiting for an “all clear” signal that enough apps were stable that we could safely execute the hardfork by upgrading their nodes to the hardfork 24 code (tagged in the hived node repository as either v1.24.2 or v1.24.3).
Several of the top 20 witnesses had already updated their code to the hardfork 24 code, but this was considered OK by me and other devs, as the HF24 code requires a super-majority of the top 20 witnesses to switch to it to trigger the new HF24 protocol it contains. This allowed some top 20 witnesses to not have to hang around as we waited for the apps/hivemind integration to get to an acceptable level before we triggered the hardfork.
Unfortunately, this led to an unexpected side-effect: the HF24 code contained a protocol change that wasn’t properly guarded against execution before hardfork 24. This means that a HF24 node could produce a block that wouldn’t be accepted by HF23 nodes. We had never seen this bug triggered before, because it could only cause a problem when a HF24 node produced a block and only then under special circumstances.
There have been a few times prior to the hardfork date when we’ve run a HF24 node as a producing node, but in the past, such a node was in the minority, so the worst thing that might have happened if this bug got triggered was that the HF24 node would temporarily fork, then fall back into consensus with the chain when the block it generated wasn’t accepted by the HF23 nodes.
But on the hardfork date, even though we didn’t have a super-majority of top 20 nodes running HF24, we did have a majority running HF24. And a majority is enough to do determine how chain forks get resolved. So when one of the HF24 nodes produced a block that was rejected by HF23 nodes, but accepted by HF24 nodes, the fork resolution logic kept the HF24 nodes on a separate unplanned hard fork from the HF23 nodes (effectively splitting the chain into two forks).
The top 20 witnesses quickly realized what was happening, so they decided to execute the hardfork by upgrading the remaining HF23 nodes to HF24, so that all nodes rejoined the majority fork. This also required all the API node operators to upgrade their API nodes to HF24, and all Hive apps switched to their HF24 versions to use those API nodes.
Because the hardfork 24 was executed a little sooner than we would have liked due to the chain split, we still hadn’t resolved all the bugs and performance issues in hivemind and Hive apps at the time of the hardfork. This led to various glitches and slowdowns experienced by apps users over the past few days. But Hive devs have been working hard to resolve the issues as fast as possible and things are already looking much better, and I expect the remaining issues to be resolved quite soon.
One thing for the future: I want to look at ways to detect problems like the chain split before they happen. One possibility could be to setup a special secondary witness node running the new code that signs blocks as a top 20 witness, but where the blocks it produces are only broadcast to one isolated old code node that would report if it was unable to accept any of the blocks it received from the new node. We can also reduce the possibility for this problem occurring in practice by having most of the top 20 witnesses upgrade very near the same time, but that can only get us so far: the ideal solution would be to have a better test method to detect such problems and I think some variation on my proposal above should work.
Hived work (blockchain node software)
We made several changes to API responses returned by hived, mostly in response to reports from apps developers:
We also did general cleanup to docker, scripts, and configuration files for hived:
We also fixed a problem with the cli-wallet: it was still using old chain-id after the hardfork, so it couldn’t generate proper transactions. It seems there were very few if any tests written previously for testing the cli-wallet.
The cli-wallet fix was necessary for exchanges, so we tagged a new version v1.24.4 that includes this change (and the other fixes above). Note that none of the above changes are needed by consensus witnesses, which is why witnesses are primarily still running 1.24.2. These changes are only needed by API nodes and exchanges.
We started a full replay yesterday to check all the above changes (this takes around 18 hours). We don’t expect any issues, since the changes were designed as non-consensus changes, only changes to the API, but better safe than sorry.
Hivemind (2nd layer social media microservice)
Most of our time was still spent on hivemind, but we made very good progress.
Our improvements to hivemind can be separated into two categories: bug fixes (wrong or missing data in API responses) and slow queries that result in unacceptable response times. Our bug fixes are usually made in response to reports from apps devs, but slow queries are usually detected by observing the performance of the postgres servers used by our API node with the pghero tool. We’ve found pghero to be very handy for finding which SQL queries are consuming the most time to complete (it functions as a profiler). It’s also useful for finding duplicate and unnecessary indexes which can impact performance.
Here’s a list of improvements and bug fixes we made to hivemind:
Decentralized list changes were also merged into the develop branch, after updates (the code had diverged a lot since these changes were made, so it required a decent amount of manual merging and testing): https://gitlab.syncad.com/hive/hivemind/-/merge_requests/275
With the latest optimizations (last big one was merge request 310 made on Saturday), hivemind seems to be working fairly well, but we still have a few more optimizations to make, and we also need to re-enable reputation updating (this was temporarily disabled because it needed further optimization to avoid unacceptable sync slow downs that caused excess loading on hivemind nodes when receiving real world traffic).
We have an optimized version of the reputation sync alogorithm on a local dev system, but we’ll be testing it further on one of our experimental API servers with real world API traffic before making it part of the official build.
We are currently running a full hivemind sync (this has generally been a 4 day process) to see if there’s any problems, as we’ve been skipping this process for the past week and doing incremental upgrades to our existing hivemind database in order to test new changes quickly.
Experimenting with optimum API node configuration
Another thing we've been doing this week is experimenting with configuring our API node for optimal performance. I've been sharing some of that information we've discovered in the API node operators channel, but I'll make a full report here later about our findings, once we've completed that work.
Condenser (open-source code for hive.blog)
We made some more changes to hive.blog and it’s wallet related to changes in hardfork 24 (mostly to the wallet as we already made several updates to condenser itself), especially removing usages of the get_state function which is being obsoleted in favor of more efficient API calls.
One fix we need to deploy soon is a change so that condenser correctly updates the vote button state after a user votes:
What’s next for the week?
We have a few more optimizations to make to hivemind, and I expect we’ll get a few more bug reports, plus we still need to deploy the final reputation calculation code. But I expect that work to slow down in the next couple of days, although the full hive sync test won’t likely complete until near the end of the week (we already observed one slow down today in the full sync with the latest changes that needs analysis).
We’ll also be testing condenser and the wallet and looking for fixes and optimizations we can make.