Part of the process within the technology world is when information moves from the highly technical/science pages to the mainstream media.
When it comes to data, the growing problem is now appearing on the latter. Even the NY Times is getting into the act, covering what we have detailed.
Artificial intelligence, specifically large language models (LLM), require a great deal of data. This comes in the form of text, photos, and video.
Fortunately, the Internet, which we can say started in 1991 with the World Wide Web, is a major data center. Over the last 3 decades, enormous volumes of data was posted. This really took off with the popularity of social media. When we consider the fact there are 500 million tweets added per day, billions of photos on Instagram, and roughly 275K hours of video uploaded to YouTube.
If this is the case, how can there be a data shortage?
This is what we will dive into with this article while also detailing how we can alter this outcome.
Image generated by VeniceAI
Locking Down Data
For years, users of social media knew they were the product. Any data posted to a social media platform is property of the company behind it.
Over much of the Internet life, these entities processed data and sold it. The most common buyer were advertisers, where targeted messages became the norm.
With the advent of LLM, things changed a great deal. Suddenly these companies were sitting on a gold mine of information. They had decades worth of content to feed through the system.
At the same time, the Internet was basically open. Starts ups such as OpenAi could go and scrape the Internet. This means taking the information from other sites and using it to train these models.
It is this move that is getting OpenAi sued by just about everyone. They are facing mountains of lawsuits, something that is far from resolved.
The value of data was realized.
For this reason, we see the response:
The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of aLL data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
This is very concerning.
We saw what happened last year when both Elon Musk and Reddit caused an uproar by limiting access to their APIs. This was done in an effort to reduce the ability to scrape their systems. While some high volume users were likely affected, those trying to extract data were mostly stopped.
This allowed Reddit to turn around and sell the data. One deal was for $60 million with Google.
Data Inequality
The challenge here is that we are rapidly moving to a time when many companies are excludes.
Essentially, Big Tech as all the data.
But widespread data restrictions may pose a threat to A.I. companies, which need a steady supply of high-quality data to keep their models fresh and up-to-date.
They could also spell trouble for smaller A.I. outfits and academic researchers who rely on public data sets, and can’t afford to license data directly from publishers. Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Mr. Longpre said.
Of course, the ones who are protected the data are not the users who actually created it. Instead, it is the mega corporations that own and run the platforms. After all, the data is on their servers which makes it theirs.
That said, even these entities are likely to hit a wall. Here is where a major problem arises.
If much of the newer data is locked behind pay walls, then we see how the information provided by the LLMs is going to be inaccurate.
Then there is the bigger problem: the exclusion of smaller firms (start ups).
How many companies can afford to give Reddit $60 million? This is where Big Tech has another advantage. If Silicon Valley behemoths have to pay for the data, not a major issue. However, this basically excludes new entrants, making the LLM world limited to a handful of companies.
Web3 Provides The Answer
The advantage to most blockchains is they are open, permissionless databases. When dealing with a blockchain like Hive, anyone can set up an API.
Here is where people need to start questioning what they are doing on a daily basis.
Unfortunately, people do not realize what they are dealing with. If data is fully in the control of a few major technology firms, what do you think the world will look like. Many are fearful of the "skynet" scenario. If that is the case, why feed them with more power?
Ultimately, this is what we are dealing with.
When adding to an owned, controlled database, we are simply feeding that entity. Data is lining up to be power. This is something that we must keep in mind.
How do we spread the power of the future around? This is done by simply adding to database that are not owned by the usual suspects. Here is where Web3 offers an alternative.
This is a topic that is going to keep getting more attention over the next 12-18 months. It is not a situation that will magically improve. The value of data is going on, meaning the industry will become fragmented along the lines of the haves and have nots.
If this does not sound appealing to you, then it is time to stop feeding Big Tech as much as possible and divert your data to those networks which are open and can be used by anyone.
Here is where Web 3.0 is democratizing data.
Posted Using InLeo Alpha
What I see is this web2 is a huge city let's say Las Vegas. Web 3 is an island with coconut trees and some small wooden houses. Soon Vegas will fill up and they will move to the island.
Interesting analogy. The trick is to get them to move to the new city now, before Vegas fills up, so the AI companies can start mining free data now.
No doubt big brother, Web3 could change everything by making data more open and fair. It’s great to see new ways of sharing info. Data is drying up fast
I feel somehow web 3.0 does a decent job in protecting data.
Congratulations @taskmaster4450! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
Your next target is to reach 2240000 upvotes.
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOP
A very interesting article, from which the wheels in my head started turning, and soon I had come up with more ideas of how Hive and Inleo could benefit.
There are multiple potential business niches and potential ways to promote web 3.0 communities as alternatives which AI firms should nurture as potential data banks they can mine for free.
What if open ai decided to start upvoting everything on HIve to encourage the rapid growth of the data base.
What if..perish the thought
What is plagarism was no longer frowned upon, but encouraged as a method of moving data from data bases locked by firewalls and enormulous user fees to databses which were free.
A Hive whale consortium controlled by AI platforms could completely transform Hive in a month or two into an army of web crawlers moving data from the paid web to the free web, as a growing source of data for the AI consortium, which could in time rival the google and twitter databases.
Or if that is distatseful, this same whale pod could just incentivize the creation of new data by people who have seemingly unlimited productivity potential, they just lack incentive. Provide incentives and problem solved.
What if a whale ten times your size incentivized you to do more. Just imagine the enormulous productivity which could be obtained by a taskmaster rewarded tenfold for every post. That would mean amazing data production. Then imagine 10x that by selectively incentivizing those with the most productivity potential.
It would take time, but the history of capitalism shows us, that with enough money, time can be overcome as a limiting factor.
Put more bodies on a task to complete it faster.
Using Dollars or Euros in the developing world is a sure way to shift workers from one industry to another.
Sure it sounds mercenary, but the world has always needed mercenaries from the time of Rome until now. And pay to archive could be the next big thing in web 3.0
After all, what are mercenaries? People standing on the wall, performing the tasks which the rich and powerful don't want to do, so they pay them to do.
It is the way of the world.
I wonder if website scrapers can find a way to work around this. Websites like Reddit are still accessible to the public. They only changed their API to protect against these bots. While putting data on Web3 is the best case scenario, until that becomes the norm, I hope monopoly of data can be addressed.
Our infrastructure will need to be reconfigured for web3, and codes of conduct will need to change in order to respect the original creators of whatever data is in question.
A future in which the owner of a meme NFT receives some type of royalty each time it is re-posted. The same would go for photographs, videos, and songs. A world where you sign a transaction authorizing your data to be fed into some AI.
It would require each individual to safely guard their seed phrases, which I don't think is practical at this moment, but we are certainly trending in that direction.
Honestly I do agree with you, this a crisis to behold in the future and it will definitely occur, books have been written to warn people but to no avail I'm guessing we should find a set net to counteract that when it happens the knowledge about web 3.0 platforms like hive are not really known which makes it much more of a problem that's surely going to happen in my personal opinion people are addicted to using web 2.0.
I'm less worried about the data and more worried about the energy consumption. Not to mention our lakes and rivers drying up!
Hello taskmaster4450!
It's nice to let you know that your article won 🥉 place.
Your post is among the best articles voted 7 days ago by the @hive-lu | King Lucoin Curator by keithtaylor
You and your curator receive 0.1088 Lu (Lucoin) investment token and a 6.67% share of the reward from Daily Report 367. Additionally, you can also receive a unique LUBROWN token for taking 3rd place. All you need to do is reblog this report of the day with your winnings.
Buy Lu on the Hive-Engine exchange | World of Lu created by @szejq
STOP
or to resume write a wordSTART