Web scraping (also known as content scraping, webscraping, web harvesting, web data mining, web data extraction) is the process of extracting specific data from a website and copying it.
I will give you tips on how to stop bots and scrapers from copying and stealing your website or blog content. I have been programming bots and scrappers for a long time, so the tips that I will give you here are from my experience. I have programmed a lot of proxy scrapers. They do have very good security in general. The most challenging bot was my Facebook Bot. I will be giving you here the best protection techniques that I have faced. As usual, the goal is to stop about 95% of scrapers and bots from stealing your content. Dedicated and very good programmers will always be able to create scrapers that will steal your content.
In this post I won't write about the robots.txt file or any other technique that gives direction to good behaved spiders and crawlers such as google bot, bing bot or other legit search engines. The bad bots don't follow the protocols and rules. They ignore robots.tx, meta tags and visit any link they want, and copy anything they want.
The main difficulty while fighting scrapers and protecting your content is keeping the user experience intact. If we add a lot of capcha, security question answer and so on, even legit users will have a hard time navigating and eventually will get bored with your website. Before fighting them I will explain the main types of scrapers.
Spiders are bots which follow recursively all the link of your website. An example is the google spider and other search engine spiders. Sometimes they may target only specific pages and data.
Scripts and shells
Different scripts for unix such as wget, are coded only for scraping. Also the grep with regular expression may be used for scraping. The same goes for php scripts. I like to put in this category also HTML scrapers such as the ones based on Scrapy, Jsoup. They also work by extracting data from the website based on the HTML design of the page.
There are a lot of webscraping services for which you can pay online. They have people whose job is to program scrapers for specific websites. Still, they use one of the techniques used by other scrapers so you can stop them most of the time.
There are a lot of people who don't value they time and copy paste all day. The art of copy pasting is increasing rapidly. In this category there are also people who put your website in frames. They can also inject scripts to change the design of your website and make it look different.
Ok, we now know about the different scraper types, but how can we detect and stop them?
Detecting And Preventing Scrapers
Bad bots, scapers and crawlers don't like to follow the rules. They ingore robots.txt, meta tags and any other rules. We can use this in our advantage. We will create a honeypot. Add in your robots.txt file a link which the scrapers shouldn't follow. The link in this example is called "nicecontent.php". The robots.txt file looks like this:
User-agent: * Disallow: /nicecontent.php
The bad guys will follow the link. In the php file, or aspx or whatever server side technology you use you will log the ip of the request. You will form a table or text file with all the IPs that followed this link. All those IPs belong to the bad scrapers. The next step is obvious. You should block requests form these IPs.
Monitor users activity
You should always keep an eye on your logs. If you see unusual activity such as many requests in a short time, then you spotted a scraper. A good idea is to limit the number of pages a user can access in a time limit. If the limit is exceeded, then you can show up a captcha. Keep in mind that you need to find a good limit, because otherwise even good users will be bored if the captcha shows very often.
On the other hand, when the requests are very fast in a short period of time, logg the IP and block that IP address. This is a very good way to stop scrapers because static IPs are expensive and scrapers aren't willing to buy many IPs just to copy your content.
Check the request headers
If a request comes with an empty User-Agent field, log the ip and block it from accessing your website! Also, there are a lot of online databases which store User-Agents used by many scrapers. Download them and block any request coming from User-Agents associated with bad scrapers. All browsers, desktop or mobile have valid User-Agent fields. So do all legit spiders from search engines.
Check the referer header, always! When I started coding scrapers I always sent requests with empty referer header and my scrapers were always getting blocked. If the referer header is empty, or always the same ex "www.google.com" or any other site then the bets are 100% that it is a scraper. Real users have variety in the referer header.
Randomize your HTML
This technique will stop most of the scrapers. When I scraped proxies from different proxy lists they always provided different HTMLs. This is very annoying for scrapers because if a scraper copies anything below <div id="content", if you change the id to id="random23847_content" and so on any time the scraper won't work. You can apply the same technique even for the class names. The second option is more difficult because you have to keep the relation of the class with the CSS.
When I coded the bot for Facebook, they randomized anything! They responded to each request with random html ids, class names and even css. I ended up using an OCR technique for finishing the bot because I had to finish it fast.
Obfuscate the data
If you detect a scraper, instead of blocking him you can poison him. The idea is that if you block a scraper, they may get back at you with other IP or another technique. Poisoning means giving the scraper fake and useless data. The scraper will keep working and steal useless data.
Depending on your website niche, content and type a very good protection may be to show the content only to registered and logged users. This is the ultimate protection, because you have just one entry point to protect: the log in page. You can use captcha there and monitor all the activity. Also you should use captcha in the registration page, so fake users won't be able to register.
The drawback of such a method is that premium content can't be accessed even by good spiders and your content won't be indexed by search engines.
Ajax and image based website
Image based content also doesn't get indexed. It is very difficult to get stealed, but search engines can't index text embedded inside the image.
If you spot a page which is copying your content, you can also file a DMCA complain via Google and get their site blocked.
Finally, you can pay for services that protect your data such as CloudFlare.