Forum Moderators: open

Message Too Old, No Replies

Identifying Whitelist Bots

Different approach

         

tangor

5:46 am on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



We can spin wheels forever with all the uglies out there, or we can list (identify) the bots to allow.

My list starts with:

bing.com (and associated msnbot, and includes yahoo)

... and what would you add to your white list robots.txt while disallowing all others?

This is not that tongue in cheek... serious query as to what is VALID these days. Seems like we are working too hard instead of smart...

keyplyr

11:24 am on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My whitelisted big 5 are:

Googlebot
Bingbot
Slurp
Yandex
Baiduspider

Including various alter egos, filtered by IP.

Note: Yandex & Baiduspider were added to my whitelist about 1 year ago and have steadily increased bringing me valid traffic, YMMV.

I also allow a dozen lesser bot UAs (unfiltered) until they screw it up :)

Staffa

11:31 am on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For the moment I only whitelist the 3 main bots and none other for these have nothing to offer.

However, I'm following the evolution (if any) on Yandex/Blekko. I can't see Yandex investing in Blekko out of the goodness of their heart and believe it is more about (eventually) Yandex having a location outside of RU.

For the moment, for Yandex (non russian version) to be able to become popular it is located in the wrong country (too many uglies coming from there). By working together with Blekko it could be beneficial for both (though I haven't seen ScoutJet crawl for aaaaages).

enigma1

5:07 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So I was thinking, what is the js use these days? 90%? 95%?

If I use some ajax to do a callback on my server (it can be totally transparent to the bots) and it will tell me whether or not it happened and then perhaps start checking for bot specifics (agents, ips rdns etc).

Of course I'll have to always serve the first page but still I could eliminate lots of useless traffic and automate such mechanism a lot.

dstiles

7:57 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Staffa - scoutjet came round here (UK) today. Yandex has a US address as well now (199.36.240.0 - 199.36.243.255) and I've seen their bot on 199.36.240.1 (also from a Turkish IP).

Enigma1 - why use JS? Few engines, as far as I can tell, do much with it apart from the invasive google. Surely blocking all legit bots in robots.txt would do it, with .htaccess or similar for those bots that do not play nicely

dstiles

8:04 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In September I allowed the following bots (2-letter codes are country of origin)...

cz:Seznam
fr:ExaLead
fr:Voila
gb:YellowPages
hk:Ichiro-Goo
jp:BaiduSpider
jp:Ichiro-Goo
jp:Yeti_Naver
nl:Plukkie
nl:Vagabondo_WiseGuys
nl:Wikimedia
ru:YandexBot
se:EntireWeb
tr:YandexBot
us:Ezine
us:Facebook
us:GoogleBot
us:LinkedIn
us:MsnBot
us:ScoutJet
us:ShopWiki
us:Tripadvisor
us:Twitter
us:Yahoo_Slurp
us:YandexBot

Staffa

10:48 pm on Oct 1, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



dstiles, thank you for the 199.36. Yandex address. I went through my log files of the last two months but nothing is showing yet, ditto for ScoutJet.

I'll keep watching ;o)

incrediBILL

12:44 am on Oct 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have about 10-15 things whitelisted, Yahoo, Bing, Ask, Google, Yandex, Facebook, etc. and everything else slams a brick wall without exception.

pageoneresults

12:52 am on Oct 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



# Whitelist

User-agent: Adsbot-Google
User-agent: BaiDuSpider
User-agent: Googlebot
User-agent: Googlebot-Image
User-agent: Googlebot-Mobile
User-agent: Googlebot-News
User-agent: MSNBot
User-agent: MSNBot-Media
User-agent: MSNBot-News
User-agent: MSNPTC
User-agent: ScoutJet
User-agent: Slurp
User-agent: Teoma
User-agent: Yahoo-Blogs
User-agent: Yahoo-MMCrawler
User-agent: Yandex
Disallow: /*?

# Blacklist

User-agent: ia_archiver
Disallow: /

User-agent: *
Disallow: /

tangor

6:52 am on Oct 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for participating. I can see that where we are located and our markets can make a difference as to our whitelist. But no matter how that might be on site/case, knowing that makes it easier to run the uglies out. :)

(403 is my friend these days, and beauty is there's not that many strings which have to be parsed!)

Pfui

3:08 pm on Oct 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A postscript... There are so many fakers that whitelisting -- via .htaccess, via robots.txt -- based on UA alone is often inadequate. (Here's a faked Googlebot [projecthoneypot.org...] that's hit me from at least 3 IPs x3 months.)

dstiles

6:08 pm on Oct 2, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I see some media bots listed by pageoneresults - I block all of those on almost all sites as they are mostly non-image sites (ie info or sales).

Pfui - good point: I associate all bot UAs with IPs. This occasionally throws up an error on (eg) yellowpages where they sometimes use a regular browser on the same IP but if they can't get it right... I also get the occasional "bad" UA on other bot IPs (MSN and google are rife here) but again they get rejected.