Stack Overflow AI Scraper

in LeoFinance14 days ago (edited)

image.png

An interesting tale of WEB2 vs WEB3

Stack Overflow, a legendary internet forum for programmers and developers, is coming under heavy fire from its users after it announced it was partnering with OpenAI to scrub the site's forum posts to train ChatGPT. Many users are removing or editing their questions and answers to prevent them from being used to train AI — decisions which have been punished with bans from the site's moderators.

I actually find this a little bit funny and sad at the same time. Stack Overflow is a site that I've used many times over to answer programming questions that I had at the time. I've used it a couple times more recently as well for more specific questions that basic tutorials can't handle. Well, apparently... users on Stack Overflow seem to think that their data belongs to them, which of course we all know that it doesn't. We can think of SO as a very technical type of social media site. As we all know: social media companies own all the data; if you mess with the company and do something that makes them lose money: you're banned. End of argument.

This false belief that Stack Overflow posts somehow belong to the users that put their blood, sweat, and tears into the site in order to be helpful and cultivate their reputation led to many being angry that AI would scrape their data and cut them out of the equation. After all if some random AI is going to rehash the data how will the user that actually answered the question get any credit for their troubles? What a pickle!

So out of protest people started deleting their own highly ranked posts, and got suspended until they changed it back... and if they didn't change it back it just got changed back anyway. Whoopsie!

Ben continues in his thread, "[The moderator crackdown is] just a reminder that anything you post on any of these platforms can and will be used for profit. It's just a matter of time until all your messages on Discord, Twitter etc. are scraped, fed into a model and sold back to you."

Oh... are we finally starting to get it now?

Funny it's taken people this long. If I had to guess I'd say that a lot of the highly ranked users on Stack Overflow are the exact type of people who would take one look at crypto and be like, "Well obviously that's a scam." I've encountered these types of older-school devs who litter the corporate world many times over. They refuse to see the potential and only see the fallout. Well it looks like they'll be forced to change their minds sooner or later at the rate this is all going.

Users are also asking why ChatGPT could not simply share the source of the answers it will dispense in this new partnership, both citing its sources and adding credibility to the tool. Of course, this would reveal how the sausage of LLMs is made, and would not look like the shiny, super-smart generative AI assistant of the future promised to users and investors.

theif-cheat-scammer-theft-steal.png

LoL!

We don't want to show the end-users that all Large Language Models are actually just counterfeiting plagiarists built on the backs of others doing the work... so instead we're just going to pretend that the AI came up with the answer on its own. Amazing logic, that. I mean lets be honest they own that data so they can do whatever they want with it.

Site moderators preventing high-popularity posts from being deleted is legally above-board. Angry users claim they are enabled to delete their own content from the site through the "right to forget," a common name for a legal right most effectively codified into law through the EU's General Data Protection Regulation (GDPR).

Wow these users have no shame.

They want to exploit a law designed to protect people from forever having their dirty-laundry posted on the internet in order to delete actually helpful data so that AI can't monetize it. I can't believe I'm saying this but I'm on the side of the corporations on this one. This is a completely hypocritical stance to take as a user. On the one hand you say you want the AI to credit your work but on the other you're trying to leverage a law designed for the exact opposite of that. This is honestly despicable childish behavior. Sorry you didn't read the contract you signed. Take the loss and move on. It's time for these users to learn the hard lesson.

Users who disagree with having their content scraped by ChatGPT are particularly outraged by Stack Overflow's rapid flip-flop on its policy concerning generative AI. For years, the site had a standing policy that prevented the use of generative AI in writing or rewording any questions or answers posted. Moderators were allowed and encouraged to use AI-detection software when reviewing posts.

Beginning last week, however, the company began a rapid about-face in its public policy towards AI.

Wow are people learning just now that corporations change their mind when they can make money?
These Stack Overflow users really are autistic aren't they?
And I mean that in the most respectful way.
We all love an autist around here especially in crypto.
I am not here to blame the "victim".
Perhaps there's another way?

image.png

Stack is not alone in reversing a principled stance on AI for profit; Valve also silently removed its AI-art ban on Steam, allowing over 1,000 AI-powered games to flood the storefront. Stack Overflow's partnership with OpenAI also follows the LLM company's recent push for increased partnerships and marquee deals, including their major announcement of a $100 billion datacenter to be built with Microsoft.

THIS IS THE FUTURE

DEAL WITH IT.

This is a tidal wave; it cannot be stopped. People need to stop fighting the obvious path of least resistance and learn to pivot and adapt to the new environment. LLMs and AI aren't going away; they are spiraling outward as far as they possibly can. Calling corporations hypocrites because they change their mind when money is involved is ironically hypocritical in itself. That's EXACTLY WHAT THEY DO EVERY TIME. Please, stop acting shocked. It's a bad look. We aren't that naïve.

bandwidth-tech.jpg

And this is actually not a problem that WEB3 solves.

In fact... one might argue that data in WEB3 is even easier to scrape because all the data is public to everyone. At least with a centralized WEB2 corporation it's their decision whether they want to make such a move. In crypto: anyone could move in to monetize the data in any way they saw fit.

The difference with WEB3 is that we can actually get paid up front for our contributions. This in combination with the fact that if someone privatized data profiteering anyone else can come along and make that same data public for all to see, which completely undercuts the private model and makes it exponentially more risky to pursue. No one can legally send a cease and desist order on WEB3. These variables may prove useful going forward.


Conclusion

Are Stack Overflow users completely out of touch with reality and the impending direction of modern technology? It seems like they are, which is really ironic considering it's a technical site used only by software developers who should obviouly know better than that. I guess nobody cares until they're on the receiving end of the stick. Go figure.

Will WEB3 solve this issue... or make it even worse? I believe that Hive is a particularly useful solution to a situation like this. After all, how many crypto social media sites can we actually get paid on? There's been a lot of hype but they've all been failures. Sometimes it feels like we are the only survivors in a river of carnage.

Sort:  

This is kind of ironic, I have been an IT now for over 18 years, how many times ppl get frustrated because they find a half solution on a forum or things are not properly documented, comes AI as an index and they get pissed off, to make things worst not considering that everything that you put out on the internet stops been yours the moment you publish on all this centralize platforms, few months ago I deploy Pixelfed for my family use, their documentation is not so user friendly and had to dive into multiple forums to get it to work properly ...if only I had an index for all this data and some one who I can contact and answer my basic questions....hmmm if only?

Wow
So that means AI is practicing theft in one way or the other if stack overflow can be stealing people’s questions and their posts
Honestly, I was always wondering the people who upload the questions or answers the AI gives us but I think I get the drill now

All the art that AI creates is also sourced from artists and rehashed.
The art theft is way more blatant than the text theft.

I have to agree with this. The data and information is going to get out there one way or another. Fighting it is kind of a moot point. I'm sure some would argue that apathy is the first step towards being oppressed, but in this case it just seems petty.

To me it just feels like a knee-jerk reaction coming from a community that never had to deal this this type of oppression. For those of us who have been in the muck for a while it's a bit difficult to see things from that perspective.

I'm one of those guys who always likes to shoot ideas. Many of them are probably not that great but just on sheer numbers there's a hit every once in a while. I usually openly share them. People have questioned me on this before. "Aren't you worried that someone will 'steal' your idea?" No, I'm not. When they ask why I always answer they won't approach it as good as I will. Should this 'help' that was offered be considered content? I guess. The original intent was to help people though. Isn't that what AI is doing?

Exactly.

People ask me this as well.
And I tell them: if someone stole my idea that would be great.
It would mean my idea exists and I didn't have to do any work at all.
This is especially true for WEB3 ideas where everyone benefits.

It's just a matter of time until all your messages on  Discord , Twitter etc. are scraped, fed into a model and sold back to you."

(emphasis added)

Isn't Discord a real-time chat platform? If so, doesn't this mean that chat content is as prone to data mining as content found on Facebook, Instagram, YouTube, and even bulletin board forums from the 1990s?

If that's true, isn't that yet another reason for us as Hivers and Leos to move away from Discord to our own solutions (or at least a Web3 alternative offered outside the Hive ecosystem)?

(I rarely use Discord for anything, so I quickly confess to being ignorant of the true capabilities and nature of Discord.)

We should use the best tools for the job.

We use Discord because it is a powerful tool.
Nothing in crypto even comes close to what Discord provides.
And even if we did that wouldn't stop anyone from scraping that service for data.

In certain respects this irrational need to stop these corporations from making money is silly.
These corporations are providing free services.
Of course they're going to monetize the data.
This isn't going to be a problem worth solving until it gets a little bigger, IMO.

It doesn't bother me that companies want to make money. Companies are made of people, and people go into business to make money. That's a good thing.

I only commented because many people (and I'm included among them) aren't used to thinking of chat content as publicly availble content for data scraping. After all, chat isn't like tweets from X/Twitter, threads from LeoThreads, or threads from Threads (by Meta). Microblog entries on those platforms are historical in nature; chats are supposed to be ephemeral unless the chatter makes them available.

OMG,I have really learnt something nice from this post,so the AI theft is actually possible and it exists. Wow thanks so much @edicted

Great, so now co-pilot/chat-gpt will start giving it the "this post has been marked as a duplicate" shit it learns from there lol

About giving credit to data soure, I think the Meta's Llama LLM does pretty well in that. I'm sure of that as I use it often in WhatsApp