homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

To Block or Not To Block That is The Question
block bots block ua

 10:19 pm on Jan 25, 2014 (gmt 0)

Ezooms - moz.com
Exabot - exabot.com
InternetSeer - InternetSeer.com

Sends lots of ie6 bots

Should I just go ahead and GeoIP block Ukraine, Russia and China in htaccess?

Blocked, but I how do discourage them from even thinking about visiting?
AhrefsBot - ahrefs.com
Baiduspider - baidu.com
Mail.RU_Bot go.mail.ru
YandexBot yandex.ru



 11:39 pm on Jan 25, 2014 (gmt 0)

The answer of course is YES - block away!

Use a whitelisted robots.txt so you can tell all the rest that honor robots.txt to nicely go away.

Beyond that, you have to play rough.

These bots are like burglars, you can put locks on the doors to try to keep them out until they find a new way in and steal all your stuff.

The only way I've found to potentially discourage them is to cloak evil nasty pages of vile content that is just delivered to those bots and is created using every wrong thing you could possibly do to intentionally screw up SEO, AdSense, etc.. For instance AdSense stop words, links to bad neighborhoods, keyword stuffing, profane language, tons of bad links, just all sorts of fun that if unfiltered would trash their site.

Basically, you have to do some really bad stuff to even get their attention and even then some of them don't care.

Most importantly, include details about their crawler UA, IP address, etc. in the cloaked content so when you find it you know exactly where it came from.

Technically you've done nothing wrong because robots.txt told them to stay away so if they picked up pages that caused them harm it's their own fault because they were told to stay out and ignored the warning.

That's how people get shot when they ignore the NO TRESPASSING signs out in the rural areas where I grew up.

Same basic principle.

Just beware as amateurs playing with this stuff can inadvertently link things back to your site and your efforts to mess with the scrapers can backfire and you end up with a bunch of junk you have to disavow in Google. Not recommended for the novice.


 12:19 am on Jan 26, 2014 (gmt 0)

Raise your hand if you saw the subtitle and though it was a question about blocking Ukrainians. As in (rule I actually use):

RewriteCond %{HTTP_REFERER} \.(ru|ua)(/|$) [NC]
RewriteCond %{HTTP_REFERER} !(google|yandex|mail)\.
RewriteRule (^|\.html|/)$ - [F]

If your site gets no legitimate human visitors from the region, leave out the second condition. (I currently bar mail.ru from images, but pages are no skin off my nose.)

In practice, most robots who send ru/ua fake referers come from blocked IPs. But the rule is a good backup.

how do discourage them from even thinking about

Can't be done. Most robots don't even modify their behavior based on response: if they've got 52 items on their shopping list, they're going to pick up 52 403s.

Does yandex not obey robots.txt on your site? Bummer. Otherwise you could put in a universal
User-Agent: Yandex
Disallow: /

and they'll never ask for anything else.


 12:43 am on Jan 26, 2014 (gmt 0)

digitalocean.com is already blocked due to bots and two below is really getting on my nerves too.
softlayer.com, wowrack.com

I'm also getting lots of visits from web hosts, maybe a new form of adverting their services, they make no contact effort.

Bing Webmaster Tools has spotted these searches:
Traffic Details for Keyword black phantom fishing jig head "buy naltrexone"

NOTE: I have a fishing log w/ Lures that contain the word(s) phantom and jig.


 5:32 am on Jan 26, 2014 (gmt 0)

Ezooms - moz.com
Exabot - exabot.com

These two honor robots.txt


 5:34 am on Jan 26, 2014 (gmt 0)

Should I just go ahead and GeoIP block Ukraine, Russia and China in htaccess?

You should deny any region (and/or Class A) that is not beneficial to your website (s).


 6:48 am on Jan 26, 2014 (gmt 0)

I block China, but for eco/political reasons; stopping all the scrapers & hackers is just gravy.


 7:26 pm on Jan 26, 2014 (gmt 0)

You will see several comments over the past few years that say, "Block all server farms but drill holes for useful and genuine bots".

In my view the latter includes yandex, which is far less intrusive than google or bing. I allow the RU version since I suspect they may share results with the US version.


 11:28 pm on Jan 28, 2014 (gmt 0)

YMMV: It was sustained and intense crawling by Yandex that got me interested in bot blocking last year. Google and Bing have never been that bad on my site.

I have a handful of traps that block things that look like bots, and things that behave like bots, except for a short whitelist (and any that slip through the nets).
(About the only trap I have that hasn't caught Yandex is the one that watches for robots.txt violators.)


 11:56 pm on Jan 28, 2014 (gmt 0)

@trintragula - Possibly a bad agent spoofing the Yandex UA?

FWIW - I have never seen Yandex behave badly and it has always obeyed robots.txt. I actually get triple digit human traffic from Yandex, more than Bing occasionally.


 12:44 am on Jan 29, 2014 (gmt 0)

@trintragula - Possibly a bad agent spoofing the Yandex UA?

Always possible, but not on the event that first got my attention: AS13238 YANDEX Yandex LLC. They were taking about 10x the number of pages that google normally averages per day, and were keeping it up.
Baidu were also fairly leadfooted, and unlike Yandex they do seem to ignore robots.txt sometimes.
I don't block by country but I get very few forum members from Russia or China, so I've not whitelisted those search engines.

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved