A Short History of Anti-Scraping

Anti-bot systems have evolved from simple IP blocking to sophisticated fingerprinting. Here's how we got here, how we handle it, and where it's going.

The web is full of data that organizations legitimately need. Price intelligence, market monitoring, competitive analysis. The people who control that data increasingly don’t want you to have it.

Anti-bot systems like DataDome look at dozens of signals to decide whether a visit should be allowed. As backend architectures have grown more complex and hosting costs have risen, there is a direct monetary incentive to block bots. Since the release of ChatGPT in late 2022, bot traffic has exploded, and with it the pressure on site owners to fight back. The result: anti-bot technologies are now widespread, and getting more aggressive every year.

The IP blocking era

In the 2010s, as web scraping became widely used by organizations to watch, access and interact with data, the primary way bots were detected was their IP address. The simplest signal was volume: bots generate more traffic than human users. If a lot of traffic comes from a single IP address, the bot is easy to spot. Consider the use case of watching prices every day on a 20,000-SKU website: that is many times the volume of requests that even a very motivated human user would make. Requests with suspicious or missing user-agent strings were also trivially blocked.

The question being asked of anti-bot systems is: does this traffic come from a human or from a bot? Since legitimate traffic is valuable, a level of tolerance should apply: in case of doubt, a human should be assumed. Sites used to make that tolerance quite wide, on the assumption that blocking even 1% of legitimate visitors would mean an unacceptable loss of revenue. The trend has since shifted towards more aggressive blocking. A large French DIY retailer, for example, is known for blocking significant amounts of legitimate, potentially paying users, supposedly in their quest to reduce the cost of serving content to bots.

The most widely used countermeasure against detection has been pools of IP addresses made accessible through web proxies. Various qualities of IP addresses are available. Data center IPs are inexpensive and ideal for accessing lightly protected websites. Residential IPs are provided by ISPs or by individual users against small incentives, and sometimes by illegitimate botnets running on pirated computers, home routers, or other IoT devices (like the IPIDEA botnet recently dismantled by Google). Because they look like real users, they are almost universally accepted by anti-bot systems. The more expensive mobile IPs are often seen as the Rolls-Royce of scraping IP addresses.

The fingerprinting era

In the last few years, anti-bot techniques have both evolved and become more widely deployed. As residential IP addresses became cheap and widely available, falling from around $15 to $3 per GB over roughly 10 years, IP-based blocking lost much of its effectiveness. Detection systems were forced to look elsewhere, turning to the web browser itself through an array of techniques collectively known as fingerprinting.

Consider a scraper that has solved the IP problem. It rotates through residential proxies, varies its request timing, and passes basic checks. Then it hits a site running fingerprint detection. The site quietly queries the browser: what GPU is rendering this page? What fonts are installed? What does a canvas element look like when drawn? What does the AudioContext API output? The scraper is running headless Chrome on a data center VM. It has no real GPU, a handful of default fonts, and a viewport that has never been resized by a human hand. Every answer it gives is wrong, or rather, too perfect. Within milliseconds, the request is flagged.

This is what makes fingerprinting so effective. While scrapers can buy residential proxies and captcha-solving services off the shelf, fingerprinting forces them to solve a harder problem: making their browsers look real. Operators prefer uniform, headless browsers at scale, but fingerprinting demands variety. Each instance needs a plausible combination of hardware signals, and that is very difficult to fake when the underlying server has a completely different GPU, driver stack, and OS configuration than a real laptop.

Scraping is getting harder

The cumulative effect of these defences is measurable. Across the industry, scraping success rates are declining. More missions require manual intervention than they did two or three years ago. We hear this from customers who run some of their own scraping and from other actors in the space. Targets that used to be accessible with basic tooling now require careful engineering. Targets that required careful engineering now sometimes require dedicated infrastructure.

This is not a temporary dip. The defences described above are compounding. Each new layer, IP analysis, fingerprinting, behavioural detection, raises the baseline cost of access. And the next generation of techniques will raise it further.

The next 5 years in anti-bot techniques

The environment is evolving quickly. Some predictions from Stratalis.

Private Access Tokens will become a significant hurdle. Apple introduced Private Access Tokens (PATs) in 2022. These tokens let a device prove it is real, owned by a real person, and running legitimate software, all without revealing identity. The server never sees who you are. It just knows you’re not a bot. For scrapers, this is a problem. You can’t fake a PAT without access to the secure enclave of a real device. As adoption spreads beyond Safari and Apple’s ecosystem to other browsers and platforms, PATs could become a standard gate that makes synthetic traffic much harder to produce.

Several IP address providers will fail to respond to new proxy detection techniques and disappear as a result. What’s being detected is not the IP addresses themselves, but the proxies that serve them. We expect that at least one major anti-bot operator will deploy new detection measures all at once, blocking a large share of proxy traffic overnight. Providers that didn’t see it coming will find their networks suddenly unusable, and their business models with them.

Large pools of real devices will become necessary to access most sites at scale. We believe the future will require operators to build large pools of heterogeneous devices in order to remain stealthy. When fingerprinting, PATs, and proxy detection all converge, the only reliable way to look human will be to actually be running on real consumer hardware with genuine OS installations, real browsers, and authentic device attestation. This shifts the discipline from a purely software problem to a hardware logistics problem.

How we manage anti-bots at Stratalis

Anti-bot systems are very ineffective at stopping a slightly determined bot author from getting data from a site. But they are very good at making it uneconomical to access at scale.

Our approach is one of surgical precision and depends on our customers’ requirements. The goal is to access data at a cost that makes the desired operation economically viable. You want to watch your competitors, but not at a budget that dwarfs any potential benefit.

The first layer of handling anti-bots is a robust infrastructure that does everything right: session management, cookie handling, TLS fingerprint consistency, proper header ordering, and realistic request timing. Getting these basics wrong is the fastest way to get flagged. Getting them right means most requests never trigger deeper inspection.

The second layer is treating requests to websites as a scarce resource to be optimized. Can we get the same data with fewer hits? Does our customer need to refresh all data at the same frequency, or is there an order of importance that allows us to deliver 95% of the value with one fifth of the traffic?

The third layer becomes target-specific. An expert web scraper looks at the defences that are active, tests them, and looks for the cheapest array of techniques that are effective: proxy pool selection, headless vs headful browser, pool of host computers, and more. They also look for how to combine more expensive options with less expensive ones in the same mission.

At the same time, we take our R&D mission seriously and are constantly monitoring the anti-bot landscape to build more sophisticated responses.

Where this leaves you

If you run your own scraping, the maintenance cost is rising and will keep rising. Pipelines that ran reliably last year need more attention this year. The engineering effort required to stay ahead of detection is growing faster than most teams can absorb alongside their core work.

If you outsource scraping, the question is no longer whether your provider can get the data today. It’s whether they’ll still be able to next year, when PATs are widespread, proxy networks are thinning out, and fingerprinting has another generation of signals to check. The providers who treat scraping as a commodity will be the first to lose access. The ones who treat it as a discipline will adapt.

That’s the bet we’ve made at Stratalis, and it’s one we’re prepared to keep making.

A short history of anti-scraping techniques and their use by website owners

The IP blocking era

The fingerprinting era

Scraping is getting harder

The next 5 years in anti-bot techniques

How we manage anti-bots at Stratalis

Where this leaves you

Real projects, real results

Building an Insurance Market Intelligence Platform

One-Click Data Extraction for Car Dealership Audits

Ready to talk about your project?

A short history of anti-scraping techniques and their use by website owners

The IP blocking era

The fingerprinting era

Scraping is getting harder

The next 5 years in anti-bot techniques

How we manage anti-bots at Stratalis

Where this leaves you

Real projects, real results

Building an Insurance Market Intelligence Platform

One-Click Data Extraction for Car Dealership Audits

Ready to talk about your project?

Tell us a bit more