How to Identify and Stop Scrapers | F5 Labs

by nlqip
September 5, 2024

IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching)

Scrapers have to distribute their traffic via proxy networks or bot nets so as to spread their traffic over a large number of IP addresses and avoid IP-based rate limits that are used to block unwanted scraping. Because of this, scrapers often do things that make it easier to identify them. These tactics include:

Round Robin IP or UA Usage

A scraper may distribute traffic equally among a given list of IP addresses. As it is highly unusual for any IP addresses to have the same volume of traffic, this serves as a sign of automated activity. There can be exceptions to this rule, such as NATed Mobile IPs that use load balancing to distribute traffic across available IPs.

Use of Hosting IPs

Getting a large number of IPs can be costly. Cloud hosting companies like Amazon Web Services and Microsoft Azure are the easiest and cheapest source of large numbers of IP addresses. However, we do not expect legitimate human users to access Web, Mobile and API applications using hosting IPs. A lot of scrapers use hosting IP infrastructure to distribute their traffic and use their compute and storage facilities to run the scraper and store the large amounts of resulting data. Looking for hosting IPs in traffic logs can help to identify unwanted unidentified scrapers.

Use of Low Reputation IPs

IP intelligence can be used to identify low reputation IP addresses that have been used in automated attacks. There are several third parties that provide IP intelligence services, and enterprises can also build their own database. Scrapers tend to rent proxy networks and botnets to distribute their traffic. These networks typically have been seen elsewhere on the internet engaged in bad activity. If you see searches on your Web, Mobile and API applications from these low reputation IP addresses, that might be indicative of unwanted scraping especially if the volume of transactions from those IPs is high.

Use of International IPs That Do Not Match Expected User Locations

The botnets and proxy networks that scrapers use tend to include IP addresses from all over the world. Some do exist that allow users to specify what geographic location they want IPs from, but these tend to cost more. Scrapers therefore tend to use IP addresses from a large number of geographic locations, including locations that do not make sense for the target business. For example, a US based retailer that has stores only in the US and only delivers in the US will not expect to receive large numbers of product searches from international, non-US based IP addresses. A handful of requests may be from US based customers that have travelled or who are using VPN services, but thousands of requests from unlikely geographic regions is indicative of automated scrapers.

Conversion or Look–to-Book Analysis

Typically, scrapers make lots of requests but never actually purchase anything. This negatively affects conversion rate business metrics. This also gives a way to identify these scrapers, by slicing and dicing the traffic similarly to traffic pattern analysis, but with emphasis on conversion rates within traffic segments. Entities that have high volume and zero conversions are typically scrapers that are only pulling data with no intention of converting. Examples of this are insurance quote comparison bots that request large numbers of quotes but never purchase. The same is true for competitor scrapers. They will conduct large numbers of searches and product page views but will not make any purchases.

Not Downloading or Fetching Images and Dependencies but Just Data

Users on Web will typically access a web page via their web browser which downloads all the web content and renders it. The content downloaded includes fonts, images, videos, text, icons etc. Efficient scrapers only request the data they are after and do not request any of the other content that is needed to render the page in a browser. This exclusion of content from their requests makes it possible to identify scrapers by looking for entities in the logs that are only requesting valuable pieces of data but not loading any of the associated content that is needed to successfully render the web page.

Behavior/Session Analysis

Website owners can analyze the behavior of incoming requests to identify patterns associated with web/API scraping, such as sequential or repetitive requests, unusual session durations, high page views, and fast clicks. A scraper might also bypass earlier portions of a normal user flow to go directly to a request, so if session analysis is seeing higher traffic volume at only certain points in a normal flow, that could also be a sign of scrapers. If your site has a search interface, checking those logs can also reveal scraper activities.

Unwanted scraper activity can also show up as higher than expected bounce rates. A bounce is defined as a session that triggers only a single request to the Analytics server, for example, if a user opens a single page or makes a single request and then quits without triggering any other requests during that session. Scraper activity tends to exhibit this behavior across large numbers of requests and sessions. Bounce rate is a calculation of the percentage of all transactions or requests that are categorized as bounces. Analyzing endpoints with above average bounce rates might reveal endpoints targeted by scrapers. Looking at clusters of traffic with high bounce rates can help identify unwanted scrapers.

Analysis of device behavior can be conducted using cookies and device/TLS fingerprints to identify sources of anomalous or high-volume requests that are indicative of scraper activity.

Account Analysis

Some websites require users to be authenticated in order to scrape the required data. For these sites, analysis of account names and email addresses used will help to identify scrapers and even sometimes identify the entity behind the scraping activity. Some entities like penetration testing companies and IP enforcement companies, tend to use their official corporate email domains to create the user accounts that are used for scraping. We have successfully used this method to attribute large numbers of unidentified scrapers.

You can also identify scrapers through the use of fake accounts. F5 Labs previously published an article describing 8 ways to identify fake accounts. The use of fake accounts is typical for scrapers that need to be authenticated. They cannot use a single account to conduct their scraping due to per account limits on requests, hence they create a large number of fake accounts and distribute the requests among a large number of accounts. Identifying fake accounts that primarily make data requests will unearth unidentified scraping activity.

How to Manage Scrapers

Management can include a spectrum of actions ranging from doing nothing, to rate limiting or access limiting, all the way to outright blocking. There are various methods that can be used to achieve scraper management with varying levels of efficacy, as well as different pros and cons. In this section we will cover some of the top methods used to manage scrapers and compare and contrast them so you can find the method best suited for your particular scraper use case.

Robots.txt

This is one of the oldest scraper management methods on the internet. This involves a file on your site that is named “robots.txt”. This file contains rules that site administrators would like bots (including scrapers) to adhere to. These rules include which bots or scrapers are allowed and those that are not. They include limits to what those scrapers are allowed to access and parts of the website that should be out of bounds. This robots.txt file and its rules have no enforcement power and rely on the etiquette of the scrapers to obey. Frequently scrapers simply ignore these rules and scrape the data they want either way. It is important therefore to have other scraper management techniques that can enforce the compliance of scrapers.

Site, App and API Design to Limit Data Provided to Bare Minimum

One effective approach to manage scrapers is to remove access to the data that they are after. This is not always possible in all cases as the provision of that data may be central to the enterprise’s business. However, there are cases where this is a plausible solution. By designing the website, mobile app and API to limit or remove the data that is exposed, this can effectively reduce unwanted scraping activity.

CAPTCHA/reCAPTCHA

CAPTCHAs (including reCAPTCHA and other tests) are another method used to manage and mitigate scrapers. Suspected scraper bots are presented with a challenge designed to prove whether they are real humans or not. Failing this CAPTCHA challenge will result in mitigation or other outcome designed for scrapers and bots. Passing the test would generally mean that the user is likely human and would be granted access the required data. CAPTCHA challenges pose a significant amount of friction to legitimate human users. Businesses have reported significant decreases in conversion rates when CAPTCHA is applied. As a result, many are weary of using it. Over the years with improvements in optical character recognition, computer vision and most recently multi-modal artificial intelligence (AI), scrapers and other bots have gotten even better than humans at solving these CAPTCHA challenges. This renders this method ineffective against more sophisticated and motivated scrapers while successfully addressing the less sophisticated ones.

Human click farms are also used to successfully solve CAPTCHA challenges for scrapers. These click farms are typically located in low-cost regions of the world and can solve CATPCHAs for cheap. These click farms have APIs that can be called programmatically with scrapers supplying the unique ID of the CAPTCHA they need to solve. The API will then respond with the token generated when the CAPTCHA has been successfully solved. This token will allow the scraper to bypass the CAPTCHA challenge. Figure 11 below shows an example of these click farms charging as little as $1 for 1000 solved CAPTCHAs.

Source link
lol

IP Infrastructure Analysis, Use of Hosting Infra or Corporate IP Ranges (Geo Location Matching) Scrapers have to distribute their traffic via proxy networks or bot nets so as to spread their traffic over a large number of IP addresses and avoid IP-based rate limits that are used to block unwanted scraping. Because of this, scrapers…