Welcome to WebmasterWorld Guest from 107.20.122.81

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

white-listing http

     

wilderness

9:48 am on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Might anybody have a list of User-Agent exceptions.

Is it just Google and Bing?

dstiles

9:31 pm on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I assume you mean bot UAs.

The list below is a very rough guide to what I allow but some bots are limited to certain versions (eg no media bots). Please don't ask what they all are - it's been a while and I'm no longer sure myself; some may not even exist now! Some were added at customers' requests (eg tripadvisor, linkedin). If anyone has any adverse experience or knowledge of them, I'd appreciate knowing.

Abrave
AnsearchBot
Apexoo
Ask Jeeves/Teoma
Baiduspider (Japan, not China)
Cabot/Nutch
Checklinks (pywikipedia
Cityreview
Digg URI Canonicalizer
DuckDuckBot
Exabot
Ezine
Facebook share follower
Fluffy the spider
Galaxybot
Gallent
Gigabot
Google
Healthbot
HuaweiSymantecSpider
Jyxobot
KiwiStatus
LinkedInBot
MultiCrawler
MuscatFerret
Ocelli
Plukkie
Regiochannel
ScoutJet (Blekko)
SeznamBot
ShopWiki (only a few shopping sites)
Speedy Spider
StackRambler
SygolBot
Szukacz
Tripadvisor (UA is ^Firefox3$
TwengaBot
Twitterbot
Vagabondo
VoilaBot
WebCorp
YandexBot
Yeti/Naver (NHN)
bing
doweb
facebookexternalhit
holmes
ichiro
voyager|
wakamecrawler
yahoo
yell
zeusbot

wilderness

10:18 pm on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



many thanks.

lucy's At Home with the robots" thread should be a good reference.

RewriteCond %{HTTP_USER_AGENT} http
#Allow google, bing, yandex, Baiduspider
RewriteCond %{REMOTE_ADDR} !199.21.9[6-9]\. [OR]
RewriteCond %{REMOTE_ADDR} !(what ever Ip Baidu [OR]
RewriteCond %{REMOTE_ADDR} !207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} !65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} !^66\.249\.(6[4-9]|[78][0-9]|9[0-5])\.
RewriteRule ^(robots\.txt|custom403\.html|landing-page\.file)$ - [L]

anybody have more?

Note; if you black-list the common term "spider", you'll need to list a line (s) excluding Baiduspider from that denial. Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.

keyplyr

11:59 pm on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





IMO the white-list will (should) be different for every site. While I allow a large portion of dstiles' list, I don't allow any Nutch, Twitter parasites or anything from Asia.

lucy24

12:07 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.

In full:

Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots)

wilderness

4:08 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Here's the other Bing rang, which I missed.
Add to the first line:

RewriteCond %{REMOTE_ADDR} !157\.(5[4-9]|60)\. [OR]

I allowed Jeves in the past, however there was very little benefit for all their crawling.

Is Yahoo still crawling? I though their spidering had been contracted to another? I haven't seen them since reactivation.

I don't allow anything in dstiles list except the two major SE's, in fact, many of these are listed in my UA's blacklist.

keyplyr

8:41 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Is Yahoo still crawling?

Absolutely. Only their search index is supplied by Bing.

wilderness

9:01 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



yahoo ;)

lucy24

9:21 am on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Absolutely. Only their search index is supplied by Bing.

Is there a vanilla YahooBot? During the month that I tracked all my robots, all I got was:

--a single visit from Yahoo! Slurp which slurped up a single page with all images
--recurring visits from YahooCacheSystem, picking up the front page + favicon

Not much to index there. I have to go back to December to find it asking for anything else-- including, ahem, robots.txt ;) Way back in November it slurped up a different picture-intensive page. But that's it.

keyplyr

7:06 pm on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Besides Slurp & YahooCacheSystem there are more than a few covert crawls which are anyone's guess what their up to, but IMO the important thing is not to forget their still in the game.

I see Slurp often. Like everything else, it depends on the site.

dstiles

8:57 pm on Feb 16, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Just picked up on HuaweiSymantecSpider from another thread hereabouts. I originally had it allowed but put it in the block list as well (which pre-empts the Allow). So - shouldn't be in my list above.

Keyplr - I almost never allow nutch but this one was an exception (the other was an old version of Yell). Cabot is the Amfibi SE's bot. To be fair, although Amfibi still exists I have no idea if it still a) crawls and b) still includes nutch in the UA. The twitter and facebook bots are VERY limited: to start with, they have to be IP-based and NOTHING from AWS. They exist only because a few customers have requested them.

Edit by dstiles: Just noticed at the foot of Amfibi's bot page: "Powered by Nutch". Whether it still includes that in the UA isn't clear.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month