Upcoming Gridcoin 3.7.0.0 changes

in #gridcoin6 years ago (edited)

Gridcoin logo

I know things are looking stale on the Gridcoin development side, but we have been working to add stability to the wallet and are now in the final test phase of Gridcoin 3.7.0.0 which will be a mandatory upgrade. There is no set release date yet other than as there are two more changes I want to include and test to further improve the forking situation when reorganizing. When released we will set a V9 trigger height roughly 2 weeks beyond release date to give exchanges time to react. After that the fork fixes will kick in and we can start removing the obsolete tally code.

While there are many, many more features and fixes done, I will try to cover some of the larger ones. Please see closed github pull requests for a hard to read but complete list.

TL;DR:

  • The chain should now fork less often.
  • Windows clients will hopefully freeze less.
  • Nodes should sync faster.
  • The wallet should use a little less CPU.

Fork improvements

We have had a lot of problems with wallets disagreeing on rewards and taking different routes on the chain. That is, different forks. We believe that the reason for this is that wallets have different views on how much each user is owed due to the way the nodes collect historical rewards and magnitudes. V9 blocks introduced in version 3.7.0.0 change this with two important fixes:

  • Rewards are now validated when connecting the block to the chain instead of when the block is received to avoid future blocks not matching tallies.
  • Reward tallies are now done in a more deterministic and synchronized way whereas it previously was initially done in sync but was easily disturbed.

These changes solve two very fork happy and hard to debug issues. The caveat is that it may not solve all fork issues, just the ones we have managed to track down.


Improved syncing

Gridcoin has a mechanism which allows clients to request blocks in bursts to improve the synchronization speed. Roughly speaking, the nodes sending block metadata will save information about the last block information sent to the syncing node, making that block a sentry. Whenever the syncing client requests the sentry block another burst of block metadata is sent along with it.

This repeats until the the syncer stops requesting blocks or until the node does not have any more blocks to send. Note: Image is not entirely true as the communication is done with hashes, not heights. The basic flow still applies.

A while ago this block burst size was changed from 500 to 1000 blocks which caused the burst size to exceed the maximum allowed transmission size, so the syncing node never got information about the sentry block. You would see that as hickups in the chain synchronization. Your node would receive a burst of blocks, pause for a long time, receive the next burst and so on. The pause bug is fixed and the pauses are now only when the remote end loads its blocks from disk.


Deadlocks

This is going to be a bit technical but I'll try to explain it as well as I can.

In computer programming it is often beneficial to do things in parallel to avoid making the program feel sluggish. For example, you do not want the user interface to freeze while the program is processing a burst of received blocks. The easiest way to solve this is to use
threads. This also has the benefit of utilizing more cores on the CPU. However, using threads is not free.
Since there are now multiple data produces and consumers you have to make sure that they are not manipulating data simultaneously. The way you solve this is by using locks.
Each thread which wants to read or write shared data will have to wait for a lock to be released before they can acquire it themselves. One single lock won't bring down a software on its own. The devious behavior come when you have multiple locks and aquire the locks in different order.

In Gridcoin we use a lot of locks for protecting various different resources. In one recent issue two particular locks, cs_Main and cs_vSend were involved in threads aquiring them in different order. Even though there is a very small risk that the threads deadlock, the order has now been changed so the deadlock problem in this case is eliminated.

If we are right about this deadlock it would explain why Windows wallets are more prone to running into this issue than Linux wallets. The reason is that Windows wallets hold the cs_Main lock while performing NeuralNet operations before they also take the cs_vSend lock. Since the NeuralNet operation can take several seconds they are way more likely to deadlock. The way this manifests itself is a user interface freeze.

Note that since we have not been able to reproduce the Windows freezes in a debugger it is very likely that the problem remains. Only time will tell. Threading and locks are tricky business at this source magnitude so we cannot assure that all the deadlocks are gone, but it should at least be better now.


Crashes

Windows users have been plagued with silent shutdowns for a while now. We tracked down a very likely cause to where the NeuralNet started scraping the BOINC statistics data while a scaping operation was already in progress. The first thing the scraper does is to delete the currently downloaded statistic files. In this case the files were obliterated right under the feet of the first scrape operation, casuing it to lay down and die. We now block concurrent stat syncing and gracefully handle file I/O errors.

There is also an included fix for an issue which caused the wallet to crash when the user issued a backup from the menu.


Performance improvements

A lot has been done to improve the overall performance of the wallet. Existing code has been tweaked and optimized while some obsolete code has been removed, opening up for further improvements.

Data structures

Following changes in the Bitcoin base we have changed the underlying data structure holding blocks to a more efficient one. This will consume around 1-1.5% more memory but every time we access a block in the existing chain we save a good amount of CPU cycles. This will especially affect chain loading but the improvement ripples throughout the entire code base.

To put some numbers to it, after syncing the chain on a Raspberry Pi 3 the old implementation spent 46% of the total execution time querying the chain for blocks. This is now down to 13%.

Checkpoints

We previously had mechanisms for relaying checkpoints between nodes. This was not needed as checkpoints are hard coded in the client, something that is good enough for its purpose. By removing the relaying we could greatly simplify the checkpoint validations which will cause the nodes to use a lot less CPU when processing blocks. This is especially noticeble when synchronizing the chain, something which should be a lot faster now.

String conversions

The code responsible for converting floating point values to and from strings has been greatly simplified and gained a large performance boost. As in the previous sync test on the Pi3, we did 22 million calls to cdbl (round a double contained in a string), spending 18% of the total execution time. Unfortunately I didn't keep the aftermath numbers, but it is much, much better now.

Post 3.7.0.0

Many of you are probably wondering where the heck the rebranding changes have gone. Don't worry, we intentinally postponed the UI changes in favor of focusing on only the stability. The rebranding will be done in 3.7.1.0 as a leisure update.

Sort:  

Thanks so much for your hard work on this ravonn. I imagine this has become a bit of a thankless job but those of us who have been with GRC for a while have seen the vast improvements you and the dev team have made and really appreciate it!

Thanks :) Quite frankly, most of the work is done by @tomasbrod, @ifoggz, @huppdiwupp and @thecharlatan these days. I just chip in when I can.

Poifekt!!

Thanks for all the hard work, if it's anything to go by the Testnet Client is way more stable and has been running for me for days at a time without any issues at all.

Sometimes a 'boring' testnet wallet is a great thing ..

boinc
Courtesy of @joshoeah

is the an API for programmers on this project?

We don't have a documented API at the moment, but the program flow and data structures roughly follow that of Bitcoin and Blackcoin. When we start refactoring the Gridcoin specific parts we have the opportunity to improve the documentation and make it more friendly to new developers.

Excellent. I'm also having silent crashes and it always was a bummer after coming back to my laptop that the logo disappears when mousing over it, signalling a crash.

Oh by the way, is there also a fix under way for the wallet backup crash? I know you can manually copy the wallet in roaming but just polling.

Also: "The wallet should use a little less CPU." -> more BOINC power yay ;-)

Wallet backup crash is also fixed. Thanks! I'll add it.

The checkpoint relaying was inherited from another coin and it was ment for checkpoints that were not hard-coded into the client, but issued by administrator. In gridcoin no-one was sending these checkpoints, so the code was unused and only slowing things down. Good job Ravon!.

I was overwhelmed by school for the past two months thus I could not test things that I wanted. The changeset is running fairly stable in testnet mode, but no one was poking it to provoke issues. Releasing prematurely will be dangerous.
So that is why it is taking so long to release the branding update. Maybe we should consider backporting the branding and less important things and do 3.6.4.0.

Maybe we should try to do a stress test run this week. Let me know on Slack if you have any tests in mind.

I know there isn't a set release date but can you ballpark it? The Branding Changes would be nice sooner but I totally understand the delay.

Without the final PR we can probably release right away. With it we should let the nodes run for the rest of the week to test.

Fair enough. Thanks.

Always good to see the development ongoing! Keep up the good work.

Excellent update, looking forward to the 3.7.0.0 release. Keep up the excellent work, dev team!

Awesome news, Ravonn!

Upvoted, resteemed, shared on twitter and telegram too. Thank you to you and all the devs for your hard work. Also thank you for the communication!!!

I read in this about stress testing further, i see Mercosity mention testnet along with others. I have ( had ) been in #gridcoin-testnet for the past year now and it seems rather dead. At one time Caraka gave me 50k testnet grc and i sent it out in 10/100/250/420/666grc increments spending 1-2 hours a day just generating traffic to help generate data into the testnet blockchain but the person never returned the favor and after a year of being told by someone they would ask Rob to mint some more I want to say I spent the whole time wasting my cpu and memory resources trying to help development of this client with idle testnet wallets with 0 coin. You would think you guys want people. No let me rephrase that... You would think that you guys " gridcoin dev's " would appreciate and need/want as many testers as possible and not ignore people whom try to help volunteer resources. Maybe I am wrong , maybe its just my opinion that you guys have your dev clique and grc8 clique of irc.freenode.net #gridcoin world @ and + and piss on everybody else. Please before releasing 3.7.0.0 review /src/net.cpp check my submitted issue and the security issues having the user " customminer " 's 3 DNS enties hard coded in with node.gridcoin.us and the dead DNS entry for seed node #2 gridcoin.asia . This issue with the gridcoin.co.uk nodes is a security issue that needs to be addressed. The fact that the seed DNS entry #3 is the same IP as entry #4 and than entry #5 is their round robin DNS ( 1 dns entry multiple server ip addresses ) and why users are forced by the compiled release from the heads of Gridcoin to be on his nodes. Can that user manipulate CPID data hosting more than 50% the Gridcoin wallet network under their thumb? Can they force a blockchain fork when they have accomplished hosting more than 50% of the user load of the Gridcoin network? I could speculate why , but maybe we just need some network savvy people added to the dev team whom do not code but care about auditing network/dns related issues such as security as the wise NeuralMiner pointed out , somebody controlling a round robin DNS entry our users use is a security threat and risk.
Also , lets get a new default gridcoinresearch.conf packaged with the windows installer as the node list is far outdated and has sub domains with no DNS records such as sepulcher.wha.la , typh00n.net , grcmagnitude.com , grc.z9.de , gridcoin.coleman-it.com etc and we keep the wiki constant for new users and you developers to keep our new users current too.

We do need and appreciate help. However, we use Slack as it's way more tailored towards development. Check out #testnet for more activity, including fund transfers to those who need.

Regarding the DNS issue, yes it will be addressed but in 3.7.1.0. The current staging version is too far into the test phase to add more non-stability changes.