The collective body that's deciding which version of code to run has a fiduciary responsibility. To some extent, that includes everyone, but in reality, it's a relatively small group of people.
You can't know with certainty what will happen without trying it, but you can gain an increased level of confidence by doing formal analysis of the change, and developing research-backed theories that are more reliable than our intuitions. You can also increase your level of confidence by running simulations.
Flag wars, much like self voting cannot be countered through code changes
Where is the evidence for this? If self-voting can't be countered through code changes, there's no point in implementing the change. As suggested in item (iii), however, I suspect that they actually can be mitigated by realigning the voting incentives.
I don't want you to provide anything, and I'm not the one that needs to be convinced. I'm content with either decision. I'm saying that the people who decide which version of code to run should demand more than an intuitive demonstration that the change will make things better.
What does better mean? In curation, "better" means that it is more likely to rank a set of posts in the correct order, according to user preferences. So, it seems to me that the witnesses who will run the code should ask whoever is proposing the change to provide some level of evidence that the post ranking after the change is likely to be more correct (closer to matching user preferences) than post ranking before the change.
Your argument seems self-contradictory. On one hand, you say that the rules don't matter - and we need to just depend on curators to downvote, but you're making that argument in support of a rule change. If we can't solve the problem of incorrect ranking of posts by changing the rules of the game, then why are we having this conversation at all?
The point of a content curation system is to produce a ranked list of content. Yes, from the voter's perspective, it's just "I rank this as x dollars", but a good content curation system will aggregate all of those individual decisions into an ordered set that approximates the actual combined preferences of the users so that readers can quickly find things of interest.
In that context, it is possible to analyze the strengths and weaknesses of a particular voting scheme before injecting it into the blockchain.
You should read A Puff of Steem: Security Analysis of Decentralized Content Curation. There is much to learn, and it suggests several techniques by which the strengths and weakness of any proposal might be quantified before slapping it into the running block chain.
It's very simple to calculate. What is the Median power of an average voter? What is the power of a community vote? Of a bid-bot? Of a whale?
Case closed :D