Web scraping (also known as content scraping, webscraping, web harvesting, web data mining, web data extraction) is the process of extracting specific data from a website and copying it.

I will give you tips on how to stop bots and scrapers from copying and stealing your website or blog content. I have been programming bots and scrappers for a long time, so the tips that I will give you here are from my experience. I have programmed a lot of proxy scrapers. They do have very good security in general. The most challenging bot was my Facebook Bot. I will be giving you here the best protection techniques that I have faced. As usual, the goal is to stop about 95% of scrapers and bots from stealing your content. Dedicated and very good programmers will always be able to create scrapers that will steal your content.

In this post I won't write about the robots.txt file or any other technique that gives direction to good behaved spiders and crawlers such as google bot, bing bot or other legit search engines. The bad bots don't follow the protocols and rules. They ignore robots.tx, meta tags and visit any link they want, and copy anything they want.

The main difficulty while fighting scrapers and protecting your content is keeping the user experience intact. If we add a lot of capcha, security question answer and so on, even legit users will have a hard time navigating and eventually will get bored with your website. Before fighting them I will explain the main types of scrapers.

Scraper Types 

Spiders

Spiders are bots which follow recursively all the link of your website. An example is the google spider and other search engine spiders. Sometimes they may target only specific pages and data.

Scripts and shells

Different scripts for unix such as wget, are coded only for scraping. Also the grep with regular expression may be used for scraping. The same goes for php scripts. I like to put in this category also HTML scrapers such as the ones based on Scrapy, Jsoup. They also work by extracting data from the website based on the HTML design of the page.

Screenscrapers

Screenscrapers open your website in a browser then use JavaScript and Ajax to extract the required information. They are created by using frameworks such as SeleniumPhantomJS etc. These kind of scrapers are very difficult to fight against because they behave like humans, execute all the scripts, download all the assets of a webistes such as images, fonts etc. Some very advanced screenscrapers can make screenshoot of your website, then use pattern recognition to extract the data. They are very rare, and they are used only for very important information. Still, I will show you how you can fight them off. 

Webscraping Services

There are a lot of webscraping services for which you can pay online. They have people whose job is to program scrapers for specific websites. Still, they use one of the techniques used by other scrapers so you can stop them most of the time.

Manual scrapers

There are a lot of people who don't value they time and copy paste all day. The art of copy pasting is increasing rapidly. In this category there are also people who put your website in frames. They can also inject scripts to change the design of your website and make it look different.

Ok, we now know about the different scraper types, but how can we detect and stop them?

 

Detecting And Preventing Scrapers

Robots.txt honeypot

Bad bots, scapers and crawlers don't like to follow the rules. They ingore robots.txt, meta tags and any other rules. We can use this in our advantage. We will create a honeypot. Add in your robots.txt file a link which the scrapers shouldn't follow. The link in this example is called "nicecontent.php". The robots.txt file looks like this:

User-agent: *
Disallow: /nicecontent.php

The bad guys will follow the link. In the php file, or aspx or whatever server side technology you use you will log the ip of the request. You will form a table or text file with all the IPs that followed this link. All those IPs belong to the bad scrapers. The next step is obvious. You should block requests form these IPs.

Monitor users activity

You should always keep an eye on your logs. If you see unusual activity such as many requests in a short time, then you spotted a scraper. A good idea is to limit the number of pages a user can access in a time limit. If the limit is exceeded, then you can show up a captcha. Keep in mind that you need to find a good limit, because otherwise even good users will be bored if the captcha shows very often.

On the other hand, when the requests are very fast in a short period of time, logg the IP and block that IP address. This is a very good way to stop scrapers because static IPs are expensive and scrapers aren't willing to buy many IPs just to copy your content.

Other things to monitor are the speed that a user fill forms, the browser resolution and language. You can get such information with Javascript. This method is very good to prevent most scrapers because even if a scraper uses proxy, they may still browse with the same resolution, user agent (browser type) and installed fonts! You will be able to detect them even if they use proxy.

Check the request headers

If a request comes with an empty User-Agent field, log the ip and block it from accessing your website! Also, there are a lot of online databases which store User-Agents used by many scrapers. Download them and block any request coming from User-Agents associated with bad scrapers. All browsers, desktop or mobile have valid User-Agent fields. So do all legit spiders from search engines.

Check the referer header, always! When I started coding scrapers I always sent requests with empty referer header and my scrapers were always getting blocked. If the referer header is empty, or always the same ex "www.google.com" or any other site then the bets are 100% that it is a scraper. Real users have variety in the referer header.

Randomize your HTML

This technique will stop most of the scrapers. When I scraped proxies from different proxy lists they always provided different HTMLs. This is very annoying for scrapers because if a scraper copies anything below <div id="content", if you change the id to id="random23847_content" and so on any time the scraper won't work. You can apply the same technique even for the class names. The second option is more difficult because you have to keep the relation of the class with the CSS.

When I coded the bot for Facebook, they randomized anything! They responded to each request with random html ids, class names and even css. I ended up using an OCR technique for finishing the bot because I had to finish it fast.

Obfuscate the data

When you make AJAX requests, you can encrypt the response sent by the server and then decrypt it with Javascript. This will stop almost all scrapers from using your API endpoints from stealing data. Only dedicated and specific scrapers which will break the obfuscation algorithm will be able to copy your content. If you also change the obfuscation algorithm frequently, the scrapers may get bored and let go of your data.

Scraper poisoning

If you detect a scraper, instead of blocking him you can poison him. The idea is that if you block a scraper, they may get back at you with other IP or another technique. Poisoning means giving the scraper fake and useless data. The scraper will keep working and steal useless data.

Premium Content

Depending on your website niche, content and type a very good protection may be to show the content only to registered and logged users. This is the ultimate protection, because you have just one entry point to protect: the log in page. You can use captcha there and monitor all the activity. Also you should use captcha in the registration page, so fake users won't be able to register.

The drawback of such a method is that premium content can't be accessed even by good spiders and your content won't be indexed by search engines.

Ajax and image based website

You can load the content of your page with ajax. This way most of the scrapers won't be able to read and copy your content. Only the screenscraper type of scrapers which load the website in a real browser and use javascript to copy the content will still be able to work. The drawback of this method is that Ajax loaded content isn't indexed.

Image based content also doesn't get indexed. It is very difficult to get stealed, but search engines can't index text embedded inside the image. 

Other methods

Put watermarks inside your content. Put warnings in your terms of use page. Sometimes people will get afraid and will stop copying your content if they are warned. You can also try to contact websites that copy your content via email.

If you spot a page which is copying your content, you can also file a DMCA complain via Google and get their site blocked.

Finally, you can pay for services that protect your data such as CloudFlare.