Welcome to WebmasterWorld Guest from 54.159.103.80

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

white-listing http

     
9:48 am on Feb 15, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Might anybody have a list of User-Agent exceptions.

Is it just Google and Bing?
9:31 pm on Feb 15, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


I assume you mean bot UAs.

The list below is a very rough guide to what I allow but some bots are limited to certain versions (eg no media bots). Please don't ask what they all are - it's been a while and I'm no longer sure myself; some may not even exist now! Some were added at customers' requests (eg tripadvisor, linkedin). If anyone has any adverse experience or knowledge of them, I'd appreciate knowing.

Abrave
AnsearchBot
Apexoo
Ask Jeeves/Teoma
Baiduspider (Japan, not China)
Cabot/Nutch
Checklinks (pywikipedia
Cityreview
Digg URI Canonicalizer
DuckDuckBot
Exabot
Ezine
Facebook share follower
Fluffy the spider
Galaxybot
Gallent
Gigabot
Google
Healthbot
HuaweiSymantecSpider
Jyxobot
KiwiStatus
LinkedInBot
MultiCrawler
MuscatFerret
Ocelli
Plukkie
Regiochannel
ScoutJet (Blekko)
SeznamBot
ShopWiki (only a few shopping sites)
Speedy Spider
StackRambler
SygolBot
Szukacz
Tripadvisor (UA is ^Firefox3$
TwengaBot
Twitterbot
Vagabondo
VoilaBot
WebCorp
YandexBot
Yeti/Naver (NHN)
bing
doweb
facebookexternalhit
holmes
ichiro
voyager|
wakamecrawler
yahoo
yell
zeusbot
10:18 pm on Feb 15, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


many thanks.

lucy's At Home with the robots" thread should be a good reference.

RewriteCond %{HTTP_USER_AGENT} http
#Allow google, bing, yandex, Baiduspider
RewriteCond %{REMOTE_ADDR} !199.21.9[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} !(what ever Ip Baidu [OR]
RewriteCond %{REMOTE_ADDR} !207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} !65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteRule ^(robots\.txt|custom403\.html|landing-page\.file)$ - [L]

anybody have more?

Note; if you black-list the common term "spider", you'll need to list a line (s) excluding Baiduspider from that denial. Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.
11:59 pm on Feb 15, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8650
votes: 385




IMO the white-list will (should) be different for every site. While I allow a large portion of dstiles' list, I don't allow any Nutch, Twitter parasites or anything from Asia.
12:07 am on Feb 16, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13744
votes: 462


Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.

In full:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)
4:08 am on Feb 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


Here's the other Bing rang, which I missed.
Add to the first line:

RewriteCond %{REMOTE_ADDR} !157\.(5[4-9]|60)\. [OR]

I allowed Jeves in the past, however there was very little benefit for all their crawling.

Is Yahoo still crawling? I though their spidering had been contracted to another? I haven't seen them since reactivation.

I don't allow anything in dstiles list except the two major SE's, in fact, many of these are listed in my UA's blacklist.
8:41 am on Feb 16, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8650
votes: 385


Is Yahoo still crawling?

Absolutely. Only their search index is supplied by Bing.
9:01 am on Feb 16, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


yahoo ;)
9:21 am on Feb 16, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13744
votes: 462


Absolutely. Only their search index is supplied by Bing.

Is there a vanilla YahooBot? During the month that I tracked all my robots, all I got was:

--a single visit from Yahoo! Slurp which slurped up a single page with all images
--recurring visits from YahooCacheSystem, picking up the front page + favicon

Not much to index there. I have to go back to December to find it asking for anything else-- including, ahem, robots.txt ;) Way back in November it slurped up a different picture-intensive page. But that's it.
7:06 pm on Feb 16, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:8650
votes: 385


Besides Slurp & YahooCacheSystem there are more than a few covert crawls which are anyone's guess what their up to, but IMO the important thing is not to forget their still in the game.

I see Slurp often. Like everything else, it depends on the site.
8:57 pm on Feb 16, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3148
votes: 4


Just picked up on HuaweiSymantecSpider from another thread hereabouts. I originally had it allowed but put it in the block list as well (which pre-empts the Allow). So - shouldn't be in my list above.

Keyplr - I almost never allow nutch but this one was an exception (the other was an old version of Yell). Cabot is the Amfibi SE's bot. To be fair, although Amfibi still exists I have no idea if it still a) crawls and b) still includes nutch in the UA. The twitter and facebook bots are VERY limited: to start with, they have to be IP-based and NOTHING from AWS. They exist only because a few customers have requested them.

Edit by dstiles: Just noticed at the foot of Amfibi's bot page: "Powered by Nutch". Whether it still includes that in the UA isn't clear.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members