How to fight against junk bot traffic

1

We have put a protection server wide with a list of not desired bots. Indeed we remarked that most of bots don't respect anymore robots.txt. We analyze traffic logs regularly and reinforce filters. We now close the door to undesired bots by force !
You can complement if you wish by setting your own robots.txt for several other reasons...


On 09/30/2016, the persistence of Chinese, Russian spam traffic and attacks, we have implemented IP blocking (not perfect as they are also servers elsewhere), see :

https://support.yoorshop.hosting/knowledgebase/2464/IP-block-by-country.html

For Prestashop, block by countries :
https://mypresta.eu/modules/administration-tools/block-ip-free 
Wordpress :
https://fr.wordpress.org/plugins/iq-block-country/
 


If your site is under attack or not, we propose various tools to use temporarily or permanently :
https://support.yoorshop.hosting/knowledgebase/3077/you-are-under-ddos-attack.html

Results of our system in awstats of cpanel :

444 Unknown error 4 751 2.9 % 0


See your visits in real time in cpanel (with theme :paper_lantern), section visitors, click all, and in the settings button on the right to see the status/error codes : click status, unclick URL and URL referrer if necessary so that to see all status codes, and if you see 444, 429, 405, 410 error on legitimate pages/URL, contact us so that we look at what we can do... It is normal to see a list of errors in your AWSTATS of cPanel, a lot of traffic can blocked for precise reasons.

To know, to limitate the crawling rate of bots, you can add this in your robots.txt after the line containing User-agent: *, this will indicate a delay of 5 seconds crawling frequency between each page :

User-agent: *

Crawl-delay: 5

We use these error codes to monitor the fight :
- 206 : partial content displayed, is a consequence of an issue with our code 410 below, contact us. Beware that it is absolutely normal to have a certain inherent number of 206 in your logs, this is due to load of external resources.
- 405 : requests which looks abnormal, types : frame/xss/injection SQL/http.
- 410 : URL referrers (russian, chinese websites and others) or keywords blocked. If some pages. on your website don't display correctly, contact us, an issue can be confirmed by presence of error code 206 which would increase.
- 429 : Number of requests has been limited by our system on delicate files like wp-login.php, but not only...
- 444 : To stop bots, and attempts to delicate files like config.php files.
- 499 : Visitor or client has not answered the server request in time, conenction is closed simply. It can be re-opened the next second in case it was a network issue.
- 499 : Visitor or client has not answered the server request in time, connection is closed simply. It can be re-opened the next second in case it was a network issue. Excessive connections from one IP can cause this too...
- 503 : Too many requests, can be used by anti ddos plugins, but also by our server security...  


Depending if your website has a lot of normal traffic or not, you can see either 10-20% of traffic blocked, but also triple than your legit traffic you want, that's life ! 

To read :

https://support.yoorshop.hosting/knowledgebase/2464/IP-block-by-country.html

You must always set up a robots.txt file that allows to limit the bot traffic on the network, and part of them does not have good intentions: they seek loopholes on your website and/or squatting your website, and therefore its resources. This can also prevent small Ddos attack types...
Contact the documentation for your CMS, or other sources on the internet to learn generate a good robot txt file.
Also : https://support.google.com/webmasters/answer/6062598?hl=en 
After, declare this file in your webmaster tools account.

We took an example of a site on our server using the CMS spip where we noticed a high level of resources used, so we just analyze the traffic by awstats from cpanel to see if it was normal/legitimate. The problem with this site is obvious as shown in the screenshot, there is a range of IP responsible for all this coming from Russia : though this reveals a robot from the same publisher is engaged in a zombie traffic on this site particular.

We check IP identity by prudence to avoid to block an IP we would love (if hosting company, this means bots) :
https://apps.db.ripe.net/db-web-ui/#/query

To block all traffic coming from the IP range in cpanel (because is the case of multiple IPs in this example), go to blockers of IPs and input the broadest IP range than that found in the stats. (Never put big ranges, this will cause slow down to your website, contact us instead, and we'll see at nginx level what we can do)
Don't overuse IP ranges, and IP blocking, this can lower your website performances.

Once done, you will see in the cPanel Errors section that traffic is now blocked.

188.143.232.0-188.143.232.255

How to stop junk traffic ?
Here is the most important primary part of the robots.txt file you can use only, the robot list you can forbid to visit your website :

User-agent: 008
User-agent: Alexibot
User-agent: AlvinetSpider
User-agent: Antenne Hatena
User-agent: ApocalXExplorerBot
User-agent: asterias
User-agent: BackDoorBot/1.0
User-agent: BizInformation
User-agent: Black Hole
User-agent: BlowFish/1.0
User-agent: BotALot
User-agent: BuiltBotTough
User-agent: Bullseye/1.0
User-agent: BunnySlippers
User-agent: Cegbfeieh
User-agent: CheeseBot
User-agent: CherryPicker
User-agent: CherryPickerElite/1.0
User-agent: CherryPickerSE/1.0
User-agent: CopyRightCheck
User-agent: cosmos
User-agent: Crescent
User-agent: Crescent Internet ToolPak HTTP OLE Control v.1.0
User-agent: DISCo Pump 3.1
User-agent: DittoSpyder
User-agent: dotbot
User-agent: EmailCollector
User-agent: EmailSiphon
User-agent: EmailWolf
User-agent: EroCrawler
User-agent: ExtractorPro
User-agent: Flamingo_SearchEngine
User-agent: Foobot
User-agent: Harvest/1.5
User-agent: hloader
User-agent: httplib
User-agent: HTTrack
User-agent: HTTrack 3.0
User-agent: humanlinks
User-agent: Igentia
User-agent: InfoNaviRobot
User-agent: JennyBot
User-agent: JikeSpider
User-agent: Kenjin Spider
User-agent: LexiBot
User-agent: libWeb/clsHTTP
User-agent: LinkextractorPro
User-agent: LinkScan/8.1a Unix
User-agent: LinkWalker
User-agent: lwp-trivial
User-agent: lwp-trivial/1.34
User-agent: Mata Hari
User-agent: Microsoft URL Control - 5.01.4511
User-agent: Microsoft URL Control - 6.00.8169
User-agent: MIIxpc
User-agent: MIIxpc/4.2
User-agent: Mister PiX
User-agent: MLBot
User-agent: moget
User-agent: moget/2.1
User-agent: MS Search 4.0 Robot
User-agent: MS Search 5.0 Robot
User-agent: Naverbot
User-agent: NetAnts
User-agent: NetAttache
User-agent: NetMechanic
User-agent: NICErsPRO
User-agent: Offline Explorer
User-agent: Openfind
User-agent: OpenindexSpider
User-agent: ProPowerBot/2.14
User-agent: ProWebWalker
User-agent: psbot
User-agent: QuepasaCreep
User-agent: QueryN Metasearch
User-agent: RepoMonkey
User-agent: RMA
User-agent: SemrushBot
User-agent: SightupBot
User-agent: SiteBot
User-agent: SiteSnagger
User-agent: SiteSucker
User-agent: Sogou web spider
User-agent: sosospider
User-agent: SpankBot
User-agent: spanner
User-agent: Speedy
User-agent: suggybot
User-agent: SuperBot
User-agent: SuperBot/2.6
User-agent: suzuran
User-agent: Szukacz/1.4
User-agent: Teleport
User-agent: Telesoft
User-agent: The Intraformant
User-agent: TheNomad
User-agent: TightTwatBot
User-agent: Titan
User-agent: toCrawl/UrlDispatcher
User-agent: TosCrawler
User-agent: True_Robot
User-agent: True_Robot/1.0
User-agent: turingos
User-agent: TurnitinBot
User-agent: UrlPouls
User-agent: URLy Warning
User-agent: VCI
User-agent: Web Image Collector
User-agent: WebAuto
User-agent: WebBandit
User-agent: WebBandit/3.50
User-agent: WebCopier
User-agent: webcopy
User-agent: WebEnhancer
User-agent: WebmasterWorldForumBot
User-agent: webmirror
User-agent: WebReaper
User-agent: WebSauger
User-agent: website extractor
User-agent: Website Quester
User-agent: Webster Pro
User-agent: WebStripper
User-agent: WebStripper/2.02
User-agent: WebZip
User-agent: wget
User-agent: WikioFeedBot
User-agent: WinHTTrack
User-agent: WWW-Collector-E
User-agent: Xenu Link Sleuth/1.3.8
User-agent: yacy
User-agent: yandex
User-agent: YRSPider
User-agent: Zeus
User-agent: Zookabot
Disallow: /

Here is second part as an example, that you must personalize according to your website :

User-agent: *
Disallow: /cgi-bin/
Disallow: /temp/ 
Disallow: /client.php
Sitemap: http://www.yoursite.com/sitemap.xml