HF20: Why STEEM software development process is broken and why to fix it NOW!

in #witness-category6 years ago (edited)

This post is technical but don't worry, I'm elaborating ideas for non tech-savvy people. Also, I'm not blaming anyone but everyone! Yet, I'll be cruel when needed. You are warned!

Hardfork 20 is by far the most traumatic release for me ever and maybe for you. I was very angry at first, but hopefully I quickly managed to remember that this is a Libre Software project. We are all responsible for what is happening and we can't say that we were not warned. There was a previous attempt to release HF20 but nobody realized that the few issues found were just the tip of the iceberg. It all means one thing: we should understand that all software is subject to technical issues, but we can prevent a shame like HF20's with the right process.

subject_to_technical_issues.jpeg
Image source: TDD and the gift of failure by Imogen Hardy

Got it? Let's dive deeper on software development then. Software development is the art of analyzing, designing, implementing and maintaining software applications. This is an art because software developers should master technical and soft skills in order to collectively deliver a working solution on-time and on-budget. Given the current situation, we can't tell that HF20 is a working solution, but a working set of problems.

HF20 is not working because it has major and critical bugs

What is a software bug then in the first place?

A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways.
Source: https://everipedia.org/wiki/Software_bug/

In software development industry we just call them "bugs" and we classify them by relevance. A bug is more relevant if it prevents delivery. Say for example that you produce milk and you realize that a barrel of 100 liters is smelling weird. You know that you can't sell it, so you should find a solution for producing acceptable milk and to prevent producing smelly milk again. HF20 was smelly, so we added some sugar, water and vanilla. It seemed acceptable... but after a week it started to smell even worst and it was too late to prevent the disaster, it was sold already! That is not how major and critical bugs should be fixed, the right thing is to aisle all the smelly milk, then send a sample to the lab and let customers known that we can't deliver till we fix the outstanding issues. Obviously, with the best possible apologize to not drain customer's trust.

"Process" - What it has to do with @ned, Steemit, Inc and the witnesses

If the example above was clear for you then you have a high level understanding on how software development should be handled. The next thing to understand is what process means in the context of software development. In short: software development process.

[...] a software development process is the process of dividing software development work into distinct phases to improve design, product management , and project management [...]. The methodology may include the pre-definition of specific deliverables and artifacts that are created and completed by a project team to develop or maintain an application.
Source: https://everipedia.org/wiki/Software_development_process/

In short, a process is a set of phases that should help to deliver working software. Who is first responsible for the STEEM software development process? Short answer: Steemit, Inc., and the head of that company is @ned. Yet, we also have the witnesses, who are responsible for keeping the steem network (the blockchain) running. It is also a way to lower the load on @ned's shoulders, since even if he pushes too hard for releasing buggy software, the top 20 witnesses should approve it. Unfortunately, none of the top20 witnesses were able to prevent the disaster. Yes, they are responsible as well. But wait, do you remember that this is a Libre Software project? It means that the source code is publicly available and that everyone (with enough budget) can do its own due diligence. We know that many people in the community are not witnesses but they certainly have enough resources to audit the source code. Third responsible found!

Duty of everyone, responsibility of none

Can't plan for the unexpected
Image source: Everyone has a story: Can't plan for the unexpected by Senior Airman Racheal Watson

In Spanish we have another moral: "After the battle, everyone is general". I told you, this post is not intended for blaming anyone, but everyone. We know we could prevent failure, but we didn't. The next step is to start fixing it!

So... this is a humble call to everyone with positive mana. We could complain about many things. But we need your voices to focus on fixing all this mess. The problem is not the software. The problem is that the process was not handled professionally. We can't never ever release like this anymore!

Now that we understand that STEEM software development process is broken, we should take it with responsibility. We can't trust a failed process anymore. We should demand a working process, in order to prevent this to happen again: https://steemit.com/steem/@steemitblog/hf20-update-hardfork-successful.

Solution: To implement a formal testing process RIGHT NOW!

Here is where my passive aggressive and demanding tone starts...

Let's use Everypedia's help for the last time in this post:

Software testing is an investigation conducted to provide stakeholders with information about the quality of the software product or service under test. Software testing can also provide an objective, independent view of the software to allow the business to appreciate and understand the risks of software implementation. Test techniques include the process of executing a program or application with the intent of finding software bugs (errors or other defects), and verifying that the software product is fit for use.
Source: https://everipedia.org/wiki/Software_testing/

That language should sound pretty clear now that you know the keywords. But let me quote again to highly a very important thing from the definition above:

Test techniques include the process of executing a program or application with the intent of finding software bugs (errors or other defects), and verifying that the software product is fit for use.

Somehow all the responsible people (starting from @ned) seem to not care about the importance of a formal testing process. We had several unclean softforks in the past and we are actually used to have the unexpected happening too frequently. Perhaps the logic is: If softforks are unclean and we are still on business, then people could tolerate a disruptive (in the worst sense of the word) hardfork. Sorry, but absolutely NO! That is not acceptable!

Let's do something different!

I miss the days when @steemitblog used to have great echo from the witnesses, there was more visibility on what is coming and what are the downsides. I'm tired of seeing witnesses that hide the risks with the false intent of keeping market value. Actually, we all were too worried on milking the cow to death. Hopefully, we still have a chance to make something different and to get back to work with dignity. Let's re-focus on delivering a working product, it is way better than marketing hot air all the way long!

Finally, for all those working long hours against the clock to stabilize the network. I have 13 years in the software industry, I know your pains. So I owe you a huge

Much thanks!

Just don't get used to the overnight hell. It is not healthy for your body, nor for your pocket ;)

Keep steeming hard!

Sort:  

Aloha! I wrote my own post on this issue, here. It is difficult to address specific points without having an overview of the details and insights into the code. This is a highly specialised and technical project and while it is probably true that fairly major holes exist in the process engineering side of things here, we don't actually know enough to comment fully accurately (in other words, judgements don't help).

I do feel though, that the points I made in my post regarding structured and formal processes, which you made to some extent too - probably apply here and need to be resolved better than just 'Everything is fine' or 'be patient'.

You pretty much summarized everything accordingly. I still believe Steemit has a shot to deliver, heck we have endured the downs like champs. The day we realize this social media is really ours, and that what we do as community will define the road of success or failure, that day, we will finally take things more seriously. Ned made the mistake of thinking that the fact this community is decentralized give him the bonus to leave things entirely to the community, he should invest and assemble a proper team and be more ambitious. Steemit is a hidden gem that should be mainstream already and competition is just around the corner. Lets hope for some better days. Great analysis.

I agree @jonsnow1983, I think steemit is amazing... nothing is perfect everything has its ups and downs.
But as you said... competition is now snapping at their heels, if steemit doesn't pull together and pick up it's game, it's going to get left behind.

@ned has failed too many times and the witnesses demonstrated that they blindly follow his lead. The worst thing is that the whole community allowed this to happen. So we all are champs! Hopefully, we can wake up now and act like adults. My personal take is to reset my witness votes. I have 28 witness votes awaiting for people with competence in the software industry. So far we have no competent witness sitting in the top 20, because all they accepted to release HF20! We need the right witnesses making the right decisions!

I think most people known what the situation is, but I'm willing to see specific proposals for bringing STEEM to the next level. So far much cheap talk IMHO.

Holger80 and emrebeyler.
Those are good witnesses

You approach a lot of subjects in this post. Like promised here is my take on this:

  • the main issue is the lack of process and on this topic a lack of process in quality assurance. I see that the tactic here is very agile.
  • most top20 follow blindness the Steem Inc decisions due to the sweet spot they are in. I can't blame them, because this is human nature.
  • the communication is bad in many ways. After the HF, you have a not functioning blockchain (and by functioning I don't refer to a block producing one) and you come up with a post that it was success. In a normal world stopping the functioning of a 300 mil USD company would be considered as a major fail and heads would have fallen.
  • yes, it is true that the code is available and we all can check it. The positive thing is that there is a testnet available, which was introduced recently.
  • the technology is new but almost every piece of technology or software like in this case was new at a certain time in history and was done under a certain process and testing.
  • I think one of the main problems is that we still have the Beta mark. It is like a waiver for all failures.

This are some ideas. Like said, the communication is not the best and most are not aware what is needed, where the community shall help and intervene. An idea to motivate people would be to delegate some SP for a month to the ones that find the bugs.

Thanks for taking time to reply @alexvan! Feedback like yours makes my writing worth the effort.

So...

You are right, @ned has a very agile approach, which is the right choice for a project like STEEM) and he just confirmed that here:

https://steemit.com/steem/@develcuy/re-ned-re-ned-re-sapphic-re-ned-re-ats-david-re-steemitblog-hf20-update-restoring-continuity-20181001t004050638z

Your other strong point is communication and you are right again. Unfortunately I'm seeing this to happen too many times not just with STINC but in many organizations. Effective communication should be a priority, so I'm happy to see many updates from @steemitblog recently but that should be the norm from now on!

recommend add tags steem and steemdevs

The issue is on steem, steemit is just part of the problem. granted the BIGGEST part of the problem and the root of the problem, but there are others, but these are the worst.

Thanks for recommending the tags! Just added them

I say it now for a year or so.
Proper service management for operations is needed!
Living up to ITIL standards.
Separate integration and production environments.
Introduce RfC, deployment and testing plans.

This is not easy, but right now the development and production areas are just mixed.. Or even the same. :(

Having any process is better than having none! Whether ITIL, CMMI or your very own. Given the current situation, I can only say that STINC/steemit has no formal process at all!

The really sad thing is, that even their software development process doesn't exist.
There seems to be nothing in place.
I can understand that devs ignore operations. But please get at least a grip on your dev processes!

Hardfork 20 is by far the most traumatic release for me ever and maybe for you.

No. By far, HF17/18 was more disruptive.

There is always something worst right? Are you stating that we should be happy because HF20 is not the worst of all time?

Check it out:

https://steemd.com/@shaneamaya

That's one of those annoying follow bots. It was (probably) going to follow every single account on the blockchain. RC limited them to 38. They could start up again, but haven't.

That's what the RC system was designed to do and it's working. Yes, there's a little bit of pain now while RC gets up and running, but a yuge benefit in the long-run.

Yes, I'm very happy with HF20.

It would be a shame to not have that past of HF20 working, my protest is against its unexpected (yet preventable) consequences! And I'm not talking nonsense, I'm just demanding to have witnesses that require STINC to follow industry standards. Is that too much to ask?

we should temper that demand with the understanding that this has never been done before

@inertia indeed, software industry is one of the most innovative ones ever and it is full of projects doing things never done before, which is key for market disruption. Yet, that isn't an excuse to avoid the well established industry standards, nor to communicate effectively in regards of the risks involved. So you cant cover up the sun with one finger, STINC should fix its mess before it is too late.

I'm sure you're referring to well established coding practices when you mention "follow industry standards." It's just a funny phrase to use in this context, where we're literally doing something that's never been done: develop a decentralized, no-fee, stake-based, content platform that is regulated by delegated proof of stake.

You know what I mean? What industry standards?

I know, you mean code review, full specifications, and such. But what do we check those against?

So yes, we should demand more from all of the witnesses and Steemit, Inc., but we should temper that demand with the understanding that this has never been done before, certainly not on this scale.