The 100,000+ IP Question: Why is a Massive Botnet Obsessed with Old Blockchain Data?

Hey everyone,

I usually write posts sharing cool technical implementations or updates on my projects. Today's post is a bit different. It's a detective story, and I'm currently standing in the middle of the crime scene.

After setting up what I considered to be a fairly quiet, niche flask application, all was well, for a moment. A small part of it's job is resolving blockchain posts from the Hive (and previously Steem) networks, serving up content based on the standard @username/permlink URL structure. It's useful, but it's not exactly Netflix.

Or so I thought.

Recently, I noticed my server resources creeping up. Memory usage on my Gunicorn workers was hitting 20%, and things just felt sluggish. I decided to pop the hood and look at the Caddy logs, expecting to see maybe a burst of traffic or a buggy script.

What I found instead was staggering.

The Scale of the Attack

In a single 24-hour period, my unassuming little server processed requests from over 128,000 unique IP addresses.

165,000+ at time of screenshot...

Let that sink in. This wasn't one over-enthusiastic person hitting refresh. It wasn't even a standard "loud" scraper hitting me thousands of times from a single server. This was a highly coordinated, distributed botnet attack.

The pattern is insidious. It's a "low and slow" attack. Each individual IP address might only hit the server once or twice a day. Standard rate-limiting (which looks for too many requests from one IP in a short timeframe) is completely useless against this strategy.

The Anatomy of a Scraper

As I dug deeper, it became clear this wasn't legitimate traffic.

1. Ignoring the Rules: The very first thing a legitimate crawler (like Google or Bing) does is check robots.txt to see what they are allowed to index. This botnet completely ignores it.

User-agent: *
Disallow: /

2. The Infrastructure: When I analyzed the top offending IP ranges, they weren't residential internet users. They were clusters of cheap VPS hosting providers and known data centers-places like ColoCrossing, RackNerd, and weirdly, a massive amount of traffic routed through Azure and DigitalOcean data centers. These are classic launchpads for cheap, disposable compute power.

[
  {
    "ip": "192.210.150.198",
    "network": "192.210.150.0/23",
    "asn": "AS36352",
    "org": "AS-COLOCROSSING"
  },
  {
    "ip": "195.178.110.199",
    "network": "195.178.110.0/24",
    "asn": "AS48090",
    "org": "Techoff Srv Limited"
  }
]

3. The "Fake Human" Behavior: The most frustrating part is what they are targeting. They aren't just hitting 404s or probing for vulnerabilities or getting 403 access denied. They are hitting valid, real pages. They are scraping actual Hive/Steem posts. Because these pages return a "200 OK" status, it makes filtering them out incredibly difficult without blocking real users.

4. The Spam Signups: Alongside the scraping, I noticed an uptick in bogus account creations. They use data-broker style email addresses, a fake name, and randomly generated usernames that are usually 6-8 letter gibberish. It's clear they are trying to gain write-access to the platform, likely to comment-spam links.

The Billion-Dollar Question: WHY?

This is what keeps me up at night. Why go through the immense trouble of renting or compromising 120,000+ IP addresses just to scrape old blockchain data? Most of the content they are pulling is from the Steem era, it's many, many years old.

I have a few theories.

Theory 1: The AI Hunger Games (Most Likely)
We are in the golden age of Large Language Models (LLMs). These models require absolutely unfathomable amounts of text data to train. The Hive/Steem blockchain is a goldmine of public, immutable, varied human text. I strongly suspect my site is being used as a straw to suck up training data for some entity's new AI model. They need the text, and they don't care about my server bills.

Theory 2: The Content Farms
SEO spam is still a massive industry. Scrapers pull existing content, spin it using basic AI tools to make it look "unique," and repost it on ad-filled spam sites to game Google rankings. Blockchain content is easy pickings for this.

The fake signups support both theories, they either want to post spam links back to their content farms, or they are testing credential lists to see if they work elsewhere.

Fighting Back

I couldn't just let the server melt. I had to get creative with mitigation.

Since standard rate limiting failed, I moved to behavioral analysis. I shifted the Gunicorn setup to Unix sockets for better performance under load and started pre-filtering aggressive user agents right at the Caddy edge.

I also realized that trying to block 150,000+ IPs individually is whack-a-mole. Instead, I identified the worst-offending data center subnets and blocked entire /24 CIDR blocks in the firewall.

But my favorite defense is the Honeypot. I implemented hidden links in the HTML that humans can't see, but bots blindly follow. As soon as an IP hits that trap URL, Fail2Ban instantly slaps a one-week ban on them. It's incredibly satisfying to watch them ban themselves. (but it's very slow going)

The Endgame

I'm getting a handle on the traffic now, but the "why" still nags at me.

Does anyone else out there host a site that resolves @username/permlink style posts for Hive or Steem? Are you seeing similar patterns? Is this a targeted attack against my specific DApp, or is every blockchain explorer getting hammered right now by this same hungry botnet?

Let me know in the comments if you're in the same boat. It's a wild time to be hosting public data on the open web.

As always,
Michael Garcia a.k.a. TheCrazyGM

I reckon your AI training theory is the most likely.

I'm seeing a huge amount of AI-generated traffic hitting my own website, although because it's an e-commerce site it's mostly competitors price scraping together with "carding" attempts (low-value transactions used to verify whether stolen credit cards can be used before the data gets sold on) with the AI being used to try to make it look like human traffic. I'm guessing there are also AI bots taking product data to help train other AI's in how to write descriptions for the kind of things I sell.

Is there a pattern to the posts being looked at ? It sounds like they are working chronologically, but if they are hitting (for example) the creative writing communities, or specific Hivers who post about particular topics (e.g. crypto news, current affairs or whatever) it might be that they are training AI to write in those specific genres.

Sort:

Trending

[-]

alonicus (70) 2 days ago

$0.08

5 votes

thecrazygm (72) 2 days ago

Oh, that's an interesting theory, I'll get back to you, gonna go see if i see a pattern in the data type.

$0.09

Looks chronological. I see ALL kinds of topics.

6 votes

Maybe they are making a copy of the whole blockchain for some reason. Although it would probably be easier to just join and set up a witness node, lol.

$0.41

7 votes

steevc (79) 2 days ago

That's interesting and weird. I like the honeypot trap idea. I have contemplated putting my old personal blogs back on a site I rent, but if it's going to get trawled like that then I'm not sure I want the hassle.

I have concerns that people are creating lots of bot accounts on Hive and using up account names. Requiring an email address may not be much defence against such people.

$0.05

2 votes

hivebuzz (74) 2 days ago

Congratulations @thecrazygm! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)

You published more than 550 posts.
Your next target is to reach 600 posts.

_{You can view your badges on your board and compare yourself to others in the Ranking}
_{If you no longer want to receive notifications, reply to this comment with the word STOP}

$0.03

1 vote

tydynrain (72) yesterday

In one form or another, I've been seeing a huge uptick in all sorts of the same sort of coordinated, distributed attacks on various parts of the Hive infrastructure. The AI-training possibility is one I hadn't heard before, but it certainly makes sense given Hive's text-based nature. I'm curious what you'll find with further investigations. 😁🙏💚✨🤙

sopel (38) yesterday

they could just start their own node, maybe even pruned node with haf database (filtering comment operation types) this way someone would get most comfortable access to all comments on blockchain as this is simple postgres query, needs much less resources then owning botnet I wish I could share this info with them

invest4free (64) 2 days ago

That’s puzzling indeed.

Not fun to have to defend against that.

toofuckeh (-3)(1) yesterday

Reveal Comment