MasterNode Alternative, Forking Solution, Efficient Self-Healing Network with simple Math

in #gridcoin8 years ago (edited)

gridcoin.png

I’ve been thinking recently about forking and network reliability of Gridcoin. Without too much politics, here is a simple math algorithm that should do the job and completely eliminate needs for master nodes. Moreover, it’s very simplistic approach but very effective one.


  • Solve forking once for all.
  • Provide much better reliability of network.
  • Decentralize network even further.
  • Protect potential intentional forking abuse.
  • Full alternative for master nodes with simple and elegant approach.

  • Assuming node is connected to N other nodes.
  • Measure latency of each, multiply that by the factor of rejected blocks, assign that value as ‘a score’.
  • Perform ranking of nodes based on score in descend order.

Network Level Routing.

  • The one with score 1 (best latency, minimal rejected blocks ) get’s most of the traffic (30%)
  • Grade two (score 2 from top) get’s 28%
  • Grade three, get’s 25%
    And so on…(proportionally, or logarithmic - even better )
  • The forking one will still get 0.1% just to check if it stopped forking and give it a chance to climb back.

Once in 10 minutes, re-sort the list giving the new scores and priorities, based on results in previous 10 minutes.


What’s the point here, and how it solve forking?

Assuming node started to fork (or it’s just unreliable), it get’s to the bottom automatically, but still get’s some very small amount of traffic. Why? - Because we are testing if it got out of fork in the meantime. If it’s not forking, and performing well, it will climb up with ranking. The same applies to one that performs best, but for any reason starts to fork. It will automatically go down.

If the good node becomes bad, it gets to the bottom where it can’t cause trouble, but still being tested with very small amount of blocks, just to check if “the behavior is improved”.

No forking, Stable Network, Automated Testing and many more.

Self testing, self-healing system, that makes best possible reliability to all the users on it's own.
The algorithm is as simple as it could be. This would not only solve many problems, but would be also look nice in magazines ;)

Pseudo code:

t = Now + 10 minutes

While (1) {
1. Measuring rejected blocks and latency during t - 10 period per node.
2. Scoring
3. Ranking
4. Sorting
   while(NOW > t){
         5. P2P distribution according to proportion.
   }
}

Each t interval is not only using the best possible nodes and prevent forking, but also gathers the sample for generating next round of distribution.



Ranking factor could be simply calculated as 1/√score to generate distribution percentage that will match the Quality / Distribution ratio automatically and proportionally.


Looking forward for opinions. (Please think twice, sometimes very simple things could be answer to very complicated problems.)

Sort:  

When node rejects a block, invalid, from it's point of view, it adds some black points the connection. When connection reaches 100 (configurable) points, it will be temporally banned. So, that's already implemented.

In your proposal, good nodes get assigned more traffic. But they do not need the traffic as they are already synced and only need few transactions here and there.

Blocks that we call forked, are still valid blocks, but they do not constitute to the best chain of current node. Nodes generally accept these blocks into a "fork" chain (virtual) for later and observe how this fork develops. The "fork" chain may at one point be inferior, but later blocks might make it superior.

Whenever a block is accepted on some chain, a fork resolver is invoked, which compares chain trust score of both and initiates a reorganize to the superior chain (disconnecting all blocks from one and connecting the others).

Nodes can start forks from four reasons:

  1. occasional glitch
  2. rejects a block
  3. is evil

The occasional glitches are common, most resovle within 3 blocks, and can't be avoided. The option 3 is not much common either. What happens often, is that node(s) reject a block, because they falsely identify it as invalid.

Think about it. When a invalid block is encountered, the node must not connect it, but the show still has to go on. The only valid way to keep the chain going in this case, is to "fork" the chain.

What causes problems, are those false positives. Node thinks a block is invalid, so it decides (given luck from PoS kernel) to "do better".

In your proposal, good nodes get assigned more traffic. But they do not need the traffic as they are already synced and only need few transactions here and there.

Actually not, it just add more randomization from the "client" let's call it point of view. More connections in single-threaded blocking fifo adds latency, so this would lead into randomization towards node from the client point of view. (client is not the right term, but let's use it for the demonstration). The overall result shall result in fair distribution, not accumulating traffic. That's the important point, that addresses many other questions such as forming regional forks.

In regards to what's already implemented, it's no secret this is heavy inspired by current fixes. This is just an attempt to make things go over further, but basically not changing the direction.

However you raised some good points that will require me to think a bit before replying. Thanks for good points!

Won't each node simply connect to the most geographically closest nodes since latency is dependent on distance? Could this maybe lead to more regional forks instead of forks dependent on other factors?

Geographic location is definitely a factor affecting latency, although not that much as it was. Nowadays ISP's trade the traffic deficits to reconcile and it frequently happens that traffic is routed via least cost, rather then least latency route. In most cases you are right, but it's not the rule. On another hand, current FIFO blocking topology shall cause additional overheat especially while forking (the rejected block is operation that do not count towards the score) and nodes that handle many connections leading to much better "fair distribution" of traffic. Bad block not only increases latency as first factor but also increases number of rejected blocks as second one leading to lower score. With better latency, theoretically forks are unlikely to happen as there is less chances that 2 nodes produce the block at the same time. The FIFO blocking principle (that we currently have as a topology) shall cause much higher latency on nodes that handle way to many connection. So the final results is yes - it should lead to finding a geographically closest one, if it's not overcorrected. Theoretically it should not cause regional forks, unless the process become multithreaded.

um how do you figure that? first there are the seed nodes since gridcoin is NOT decentralized check /src/net.cpp around lines 113-118 now that Barton26 removed customminers BS 3 entries and left his single where he can do round robin with 1. Are you aware how BGP works and routing tables and peering? The Gridcoin client is not and connects where its told to connect or where you tell it to.. I thought up a certificate system when we were talking about paying for full nodes based upon how much of the network the node contributes to based upon connection count. I think it was an interview with peercoin we had on a hangout that I stole the idea from them if I remember correctly like node.gridcoin.us would delegate to the end user nodes that have connected to it and been confirmed on the right chain and then as if it was load balancing the Gridcoin wallet Grid/spider web of clients tells the wallet to connect to a specific node and i bet geo-ip location could be added in so you connect to the closest node with least latency to you vs say someone in Texas USA connecting to the UK or Asia when there is a node in Dallas. This would be a neat implementation then node.gridcoin.us is the only node that knows what other nodes are full nodes they could not get ddos'ed because the client connects to node.gridcoin.us says HAY I WANT TO BE A FULL NODE and it does checks and balances ( block hashes ) and then adds that node and its location to a master node list only node.gridcoin.us would know vs a wiki page.

Sounds good, I'm eager to hear what devs opinions are...

me too, hopefully they will find time to consider and form the idea of this logic. In the overflow of informations it's sometimes difficult to think of every proposal, plus my english is not so great. But really looking forward to hear what they think of it.

The forking one will still get 0.1% just to check if it stopped forking and give it a chance to climb back.

So how do you define a fork?

And how do you define a fork you want compared to one you don't want?

Every fork is valid. To decide which one is the "correct" one, you would need to solve the "math" of forks, at which point, you already have the solution to the problem you want to do your stuff on.

With all the respect, please read this once more, and if possible try to use whiteboard. Many problems, including these you are referring to are solved as a side-effect of this approach.

The side-effect is automatic consensus of which chain is going to be applied, with minimal, if any damaging effects of forks. As the 1%, or even 0.01% of traffic is just used to check if the node is back to the consensus chain.

This is just one of the side effects. While solution is simple, the whole puzzle is much more complicated. I spent days with this 'simple' thing in front of whiteboard trying to find the problem why it would not be applicable, but did not managed yet.

Moreover, this is applicable in practice. (my primary project was related to BGP Routing Mechanism, the idea come from that project, as I managed to keep BGP sessions stable at all the time, using the least cost route) (you could compare routes and their different costs as forks).

Putting away the other answer, you already assuming more connections then a node normally has.
Also (without whiteboard) that sounds like centralizing the decision process, which will lead to more forks going on, since with the reorganizing of weights one chain suddenly becomes more important then before.

You are right. very stupid of me. xD

The question that would help here, for you is: Why this will not work / or give more cons then pros.

Thanks.

you can edit posts.

I didn't say it won't work. I did say that you need to know which is the "good" fork for this to work, but if you know that, you don't need it. Because you already know what the best fork is.

I know, thanks. But I'm a bit busy currently trying to find out how to downvote myself xD

The flag on the upper right of each comment/post is the downvote.

Thanks, I think i found.

This isn't an area I have much experience in, so please forgive me if my concern isn't valid: Could this lead to a "rich get richer" phenomenon? i.e. Is it possible that first ranked node gets a higher percentage of the traffic over time, i.e. traffic for some particular node -> 100% as t -> infinity?

Edit: I realize that you designed this in order that such centralization wouldn't happen, but I'm not completely convinced that this wouldn't occur - though I am open to being convinced.

Edit #2: I see that lennstar already raised a similar point to this.

These are good questions, the t is designed never to be an infinity in order to prevent this from happening. t could be a minute, or 10 minutes, but in implementation such as this one it's a hardcoded value. (unless we extend the algorithm further and make it dynamic too, but I was not going that deep).

On the other side, there needs to be a gap between percentage of the traffic so it never happens that single node get 100% from another one. Let's assume there is 10 connections, and most of the traffic (70%) goes via node with grade 1 (of that practical "client" let's say). One small latency increases in combination with another node performing well will lead into swap so the first one is not at the first place anymore. It should add much randomization due to factors difficult to predict.

Checkout my answer to @donkeykong9000, it explains a lot on how it would affect this as well.

To make things easier, I am not 100% convinced too, actually I never am until something is production. This is just a draft to be discussed.

As for my conversation with @lennstar, I was tired, so whole thing got into trolling :), but I tried to address that issues in response to @donkeykong9000.

Also, after days thinking of it, everything was looking very self-explanatory in my head, but now when I get back on how I started designing this, I totally understand is not as close as simple as it looks like.

If I manage to find a time, i'll try to run a simulation on how this would perform. But as most likely this is not going to happen till end of the march, is somebody could do it, I believe that would be the only approach to confirm or reject this as a possible solution.

I feel (at least about myself) it's a layman's discussion. There is no documentation and I would have to dig through the code to find out how the chain grows and how consensus is met.
There are several fixes for the current forking, what suggests nobody knows what is the real cause, or there are many. Network already has an inbuilt some form of Byzantine fault tolerance - so it is not working now as intended.

Deep theoretical analysis, implementation and testing would only answer whether your model is good. I think it's promising and worth further research.

I totally agree, it is a layman's discussion. The idea is to engage more people to think of problems and solution, even these who were not directly in touch with low level technicalities. If i throw a bunch of bayesian probabilities and calculations, it's unlikely for most of people to understand, unless they are from our niche.

On another side, making a full 'by the book' documentation would require a full time job. It's also no strange that even in my full time job, I do expose a draft models to layman's discussions, in order to pass or fail. If a draft model fails such one, then obviously thinking of it further is a waste of time. Sometimes, people who are not educated in niche could shows very high level of 'common logic' (even beyond the educated ones) as we are all sometimes vulnerable to 'thinking the right way'.

As for the understanding, I fully admit I am treating the symptoms, not the diagnosis :)

Fully agree, the first step towards making a model out of draft model is proper documentation and performing a deep theoretical analysis. (that should probably move to more specialized place such as github).

On the other side, if solutions that we have in place shows good enough, this shall only stay as 'to think of' once we encounter another problem.

While it is treating of symptoms, doing it systematically usually eliminate the root cause as well.

And if you ask me, the root cause is using boost libs and other third parity gigs of code that none of us understand. It was time saver at the beginning but it leads to effects such the ones we have. In my opinion, no good system was ever good enough unless it it uses no more then iosocket and stdout. :)

If you ask me, the root cause of forking is single threaded network stack, that increases latency, that leads to possibility for multiple node to generate the same block...and so on (among many other factors that I also don't understand, nor I would be able to without reading gigs of code).

These are good questions, the t is designed never to be an infinity in order to prevent this from happening. t could be a minute, or 10 minutes, but in implementation such as this one it's a hardcoded value. (unless we extend the algorithm further and make it dynamic too, but I was not going that deep).

I should have asked: it is possible that after every iteration of the loop, that a single node accumulates more and more traffic?

It should add much randomization due to factors difficult to predict.

Are you saying that 1) there is already randomization inherent to the model, and/or 2) that randomization should be intentionally introduced?

I should have asked: it is possible that after every iteration of the loop, that a single node accumulates more and more traffic?

No, with current topology it's not possible. The current principle we have in place is single threaded blocking FIFO, meaning more connections lead to higher latency. That's why this implies fair traffic distribution. The more traffic node accumulates, the latency will be higher, leading to getting lower rank, thus receiving less traffic.

Are you saying that 1) there is already randomization inherent to the model, and/or 2) that randomization should be intentionally introduced?

  1. It's definitely inherent to model, as per above answer, controlled by both predictable and unpredictable values. It's also intentional, while unpredictable values add a level of randomization, predictable values keeps it within the threshold ( dynamically calculated by known values.) Draft design of the model aims to produce desirable effects as much as possible to results of a side effects in order to cut the needs of way too much coding and changes.

No, with current topology it's not possible. The current principle we have in place is single threaded blocking FIFO, meaning more connections lead to higher latency. That's why this implies fair traffic distribution. The more traffic node accumulates, the latency will be higher, leading to getting lower rank, thus receiving less traffic.

Measure latency of each, multiply that by the factor of rejected blocks, assign that value as ‘a score’.

What if the top node has no rejected blocks? Is that possible? What if the number of rejected blocks by other other nodes is so high that the top node remains the top node, even taking higher latency into account? Is there a guarantee that one node will never stay at the top? Could a malevolent coalition of nodes take advantage of this model?

Sorry if these questions are addressed by basic knowledge, again I have little experience here.

Under that condition, that node will start accumulating more connections then usual, that directly lead into much higher latency. While it's theoretically possible that it will survive few iterations, the probability drastically falls with each one. Latency takes important part here, and the number of connections increase latency exponentially, hence the drastic fall in probability that the node will keep on top and accumulate large number of connections at the same time.

The final result shall be that the nodes of zero rejected blocks gets approximate same number of connections.

Number of connections is also an important deterministic parameter in this model.

No problem for asking, the post was intentionally explained in common language and common logic so everyone can get involved.

In addition, for the purpose of better explanation, 10 of us communicate with the same node. In order for my packet to get an answer, assuming I'm the last one who sent it, the node needs to answer your first (with whole roundtrip). That's the key point of how latency caused by single threaded networking prevents the scenario you describe.
(of-course, it's possible that I am missing something), this is still at the level of theoretical discussion, so don't take anything 100% to be correct.

How do you know "what you want" when ranking nodes?

Say you're on a fork, in this process your node will rank all nodes also on your fork high on it's list because those forked nodes agree with your chain, where all nodes on the real chain are ranked low because they disagree with your chain.

Now, I'm not too keen on the technicals so please correct me if I'm wrong, but this sounds rather similar to the banscore system we currently have. If a node sends you invalid blocks (like blocks on a fork), their banscore goes up. If a nodes banscore reaches 100, then your node will refuse to talk with the banned node.

The major contrast I see with your proposal and the current system is that in your proposal nodes at the bottom of the list still hold influence over connected nodes (albeit a small influence). This sounds like it would help with forking (It might not, I really don't know), but it may leave the network open to a DDoS attack as those nodes still maintain a connection to each other even though they're at the bottom of each others lists.

I'd like to see some dev feedback here because I'm really not educated enough to know if this would work in practice.

Say you're on a fork, in this process your node will rank all nodes also on your fork high on it's list because those forked nodes agree with your chain, where all nodes on the real chain are ranked low because they disagree with your chain.

Actually not, that's why latency takes important role. The node that's first on my list is going to have many disagreements with other nodes causing latency increases. Therefore, it's internal list of priorities is going to change. As a results, it's expected that node will have much higher latency due to overheat caused by rejected blocks with other nodes (that does not count as successful operations), and it will directly cause increase latency towards my node, putting it onto the bottom of the list.

The banscore system was one of inspiration for this, to be clear this is not from scratch. I went through the code in order to understand how we can improve, not replace. As for the Ddos, yes that method could be used to destabilize good nodes, but that's exactly why we are putting a small influence even towards those in forks, on placed on really bad ISP connection. Meaning, they are not necessarily on fork. In this scenario, these are actually helping network to heal until ddos stops. (at least in theory).

The another inspiration about this come from thinking on how to improve steem witnesses model.

And the third one is from real life experience of BGP routing implementation where I was able to confirm that this could be a working model. (Not necessarily for GRC and Steem, but for BGP shows great results so far).

I like the idea of a more robust system for checking and testing node quality. The current system of if it send bad blocks to much = 24 hour ban, Isn't robust enough to do the job. Mainly because the 24 hour ban can cause more harm than good in some cases, (for example bugs in the software causing nodes to get banned all over the place, less nodes, less security etc.)

amen to that :)