We have put a protection server wide with a list of not desired bots. Indeed we remarked that most of bots don't respect anymore robots.txt. We analyze traffic logs regularly and reinforce filters. We now close the door to undesired bots by force !
You can complement if you wish by setting your own robots.txt for several other reasons...
On 09/30/2016, the persistence of Chinese, Russian spam traffic and attacks, we have implemented IP blocking (not perfect as they are also servers elsewhere), see :https://support.yoorshop.hosting/knowledgebase/2464/IP-block-by-country.html
For Prestashop, block by countries :
Results of our system in awstats of cpanel :
If your site is under attack or not, we propose various tools to use temporarily or permanently :
444 Unknown error 4 751 2.9 % 0
See your visits in real time in cpanel (with theme :paper_lantern), section visitors, click all, and in the settings button on the right to see the status/error codes : click status, unclick URL and URL referrer if necessary so that to see all status codes, and if you see 444, 429, 405, 410 error on legitimate pages/URL, contact us so that we look at what we can do... It is normal to see a list of errors in your AWSTATS of cPanel, a lot of traffic can blocked for precise reasons.
To know, to limitate the crawling rate of bots, you can add this in your robots.txt after the line of the robot concerned :
We use these error codes to monitor the fight :
- 206 : partial content displayed, is a consequence of an issue with our code 410 below, contact us. Beware that it is absolutely normal to have a certain inherent number of 206 in your logs, this is due to load of external resources.
- 405 : requests which looks abnormal, types : frame/xss/injection SQL/http.
- 410 : URL referrers (russian, chinese websites and others) or keywords blocked. If some pages. on your website don't display correctly, contact us, an issue can be confirmed by presence of error code 206 which would increase.
- 429 : Number of requests has been limited by our system on delicate files like wp-login.php, but not only...
- 444 : To stop bots, and attempts to delicate files like config.php files.
- 499 : Visitor or client has not answered the server request in time, conenction is closed simply. It can be re-opened the next second in case it was a network issue.
- 499 : Visitor or client has not answered the server request in time, connection is closed simply. It can be re-opened the next second in case it was a network issue. Excessive connections from one IP can cause this too...
- 503 : Too many requests, can be used by anti ddos plugins, but also by our server security...
Depending if your website has a lot of normal traffic or not, you can see either 10-20% of traffic blocked, but also triple than your legit traffic you want, that's life !
To read :
You must always set up a robots.txt file that allows to limit the bot traffic on the network, and part of them does not have good intentions: they seek loopholes on your website and/or squatting your website, and therefore its resources. This can also prevent small Ddos attack types...
Contact the documentation for your CMS, or other sources on the internet to learn generate a good robot txt file.
Also : https://support.google.com/webmasters/answer/6062598?hl=en
After, declare this file in your webmaster tools account.
We took an example of a site on our server using the CMS spip where we noticed a high level of resources used, so we just analyze the traffic by awstats from cpanel to see if it was normal/legitimate. The problem with this site is obvious as shown in the screenshot, there is a range of IP responsible for all this coming from Russia : though this reveals a robot from the same publisher is engaged in a zombie traffic on this site particular.
We check IP identity by prudence to avoid to block an IP we would love (if hosting company, this means bots) :
To block all traffic coming from the IP range in cpanel (because is the case of multiple IPs in this example), go to blockers of IPs and input the broadest IP range than that found in the stats. (Never put big ranges, this will cause slow down to your website, contact us instead, and we'll see at nginx level what we can do)
Don't overuse IP ranges, and IP blocking, this can lower your website performances.
Once done, you will see in the cPanel Errors section that traffic is now blocked.
Here is the most important primary part of the robots.txt file you can use only, the robot list you can forbid to visit your website :
User-agent: Antenne Hatena
User-agent: Black Hole
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DISCo Pump 3.1
User-agent: HTTrack 3.0
User-agent: Kenjin Spider
User-agent: LinkScan/8.1a Unix
User-agent: Mata Hari
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: Mister PiX
User-agent: MS Search 4.0 Robot
User-agent: MS Search 5.0 Robot
User-agent: Offline Explorer
User-agent: QueryN Metasearch
User-agent: Sogou web spider
User-agent: The Intraformant
User-agent: URLy Warning
User-agent: Web Image Collector
User-agent: website extractor
User-agent: Website Quester
User-agent: Webster Pro
User-agent: Xenu Link Sleuth/1.3.8
Here is second part as an example, that you must personalize according to your website :
Having a website running with SSL at 100% is highly recommended :- a padlock in your URL...
This should be a basic rule known of all, but fact is that it is not... The php config file...
You can disable our Nginx security rules temporarily, or permanently : not recommended except for...
1. Automatic restore (Your data are also kept external server over last 15 days for shared...
This feature is enabled by default because it is the one that allows you to load images, for...