homepage Welcome to WebmasterWorld Guest from 54.161.191.154
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
white-listing http
wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 9:48 am on Feb 15, 2012 (gmt 0)

Might anybody have a list of User-Agent exceptions.

Is it just Google and Bing?

 

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4417953 posted 9:31 pm on Feb 15, 2012 (gmt 0)

I assume you mean bot UAs.

The list below is a very rough guide to what I allow but some bots are limited to certain versions (eg no media bots). Please don't ask what they all are - it's been a while and I'm no longer sure myself; some may not even exist now! Some were added at customers' requests (eg tripadvisor, linkedin). If anyone has any adverse experience or knowledge of them, I'd appreciate knowing.

Abrave
AnsearchBot
Apexoo
Ask Jeeves/Teoma
Baiduspider (Japan, not China)
Cabot/Nutch
Checklinks (pywikipedia
Cityreview
Digg URI Canonicalizer
DuckDuckBot
Exabot
Ezine
Facebook share follower
Fluffy the spider
Galaxybot
Gallent
Gigabot
Google
Healthbot
HuaweiSymantecSpider
Jyxobot
KiwiStatus
LinkedInBot
MultiCrawler
MuscatFerret
Ocelli
Plukkie
Regiochannel
ScoutJet (Blekko)
SeznamBot
ShopWiki (only a few shopping sites)
Speedy Spider
StackRambler
SygolBot
Szukacz
Tripadvisor (UA is ^Firefox3$
TwengaBot
Twitterbot
Vagabondo
VoilaBot
WebCorp
YandexBot
Yeti/Naver (NHN)
bing
doweb
facebookexternalhit
holmes
ichiro
voyager|
wakamecrawler
yahoo
yell
zeusbot

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 10:18 pm on Feb 15, 2012 (gmt 0)

many thanks.

lucy's At Home with the robots" thread should be a good reference.

RewriteCond %{HTTP_USER_AGENT} http
#Allow google, bing, yandex, Baiduspider
RewriteCond %{REMOTE_ADDR} !199.21.9[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} !(what ever Ip Baidu [OR]
RewriteCond %{REMOTE_ADDR} !207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} !65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteRule ^(robots\.txt|custom403\.html|landing-page\.file)$ - [L]

anybody have more?

Note; if you black-list the common term "spider", you'll need to list a line (s) excluding Baiduspider from that denial. Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 11:59 pm on Feb 15, 2012 (gmt 0)



IMO the white-list will (should) be different for every site. While I allow a large portion of dstiles' list, I don't allow any Nutch, Twitter parasites or anything from Asia.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4417953 posted 12:07 am on Feb 16, 2012 (gmt 0)

Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.

In full:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 4:08 am on Feb 16, 2012 (gmt 0)

Here's the other Bing rang, which I missed.
Add to the first line:

RewriteCond %{REMOTE_ADDR} !157\.(5[4-9]|60)\. [OR]

I allowed Jeves in the past, however there was very little benefit for all their crawling.

Is Yahoo still crawling? I though their spidering had been contracted to another? I haven't seen them since reactivation.

I don't allow anything in dstiles list except the two major SE's, in fact, many of these are listed in my UA's blacklist.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 8:41 am on Feb 16, 2012 (gmt 0)

Is Yahoo still crawling?

Absolutely. Only their search index is supplied by Bing.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 9:01 am on Feb 16, 2012 (gmt 0)

yahoo ;)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4417953 posted 9:21 am on Feb 16, 2012 (gmt 0)

Absolutely. Only their search index is supplied by Bing.

Is there a vanilla YahooBot? During the month that I tracked all my robots, all I got was:

--a single visit from Yahoo! Slurp which slurped up a single page with all images
--recurring visits from YahooCacheSystem, picking up the front page + favicon

Not much to index there. I have to go back to December to find it asking for anything else-- including, ahem, robots.txt ;) Way back in November it slurped up a different picture-intensive page. But that's it.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4417953 posted 7:06 pm on Feb 16, 2012 (gmt 0)

Besides Slurp & YahooCacheSystem there are more than a few covert crawls which are anyone's guess what their up to, but IMO the important thing is not to forget their still in the game.

I see Slurp often. Like everything else, it depends on the site.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4417953 posted 8:57 pm on Feb 16, 2012 (gmt 0)

Just picked up on HuaweiSymantecSpider from another thread hereabouts. I originally had it allowed but put it in the block list as well (which pre-empts the Allow). So - shouldn't be in my list above.

Keyplr - I almost never allow nutch but this one was an exception (the other was an old version of Yell). Cabot is the Amfibi SE's bot. To be fair, although Amfibi still exists I have no idea if it still a) crawls and b) still includes nutch in the UA. The twitter and facebook bots are VERY limited: to start with, they have to be IP-based and NOTHING from AWS. They exist only because a few customers have requested them.

Edit by dstiles: Just noticed at the foot of Amfibi's bot page: "Powered by Nutch". Whether it still includes that in the UA isn't clear.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved