Forum Moderators: open
Has anyone else seen it and have any information about what it does? I looked in Google and found a suggestion it might belong to Snap search [snap.com...]
Agent: Snapbot/1.0
IP: 66.234.139.#*$!
Back in 2005, this bot called itself "snap.com beta crawler v0" [webmasterworld.com].
IP range is not subdelegated and reverse DNS lookups fail. It's not from Snap.com / Idealab [psychedelix.com]?!
I'm feeling very uneasy about an unknown organisation taking snapshots of all my web pages, so I'll be denying it asap.
I checked the logs of the one site which does have a link to the new site mentioned in my first message and that is where the bot found the link.
malachite: In both sites I've seen it visit, it asked for robots.txt, but of course I have no way of knowing if it would obey one, as I have never seen or heard of the bot previously.
<makes a mental note to never replace IP numbers with three x's - it gets changed into looking like an obfuscated swear word :o >
We shall see if it intentionally tries to crawl forbidden robots.txt paths...
66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.209 - - [22/May/2006:18:15:59 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.210 - - [23/May/2006:18:45:18 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:08:42 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:37:06 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.194 - - [24/May/2006:15:55:11 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.205 - - [25/May/2006:12:14:14 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.194 - - [25/May/2006:19:46:33 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.214 - - [30/May/2006:18:44:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.212 - - [30/May/2006:21:07:00 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [30/May/2006:21:33:45 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.197 - - [31/May/2006:00:11:19 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.216 - - [01/Jun/2006:15:57:46 -0700] "GET /robots.txt HTTP/1.0" 403 772 "-" "Snapbot/1.0"
66.234.139.204 - - [01/Jun/2006:15:57:47 -0700] "GET /directory/file.html HTTP/1.0" 403 772 "-" "Snapbot/1.0"
FWIW, WHOIS e-mail addresses show the IPs belong to BBCOM via bb.com and bb2.net [bb2.net]; the latter routes back to bb.com, out of Backbone Communications in LA. IP Range: 66.234.128.0 to 66.234.255.255 Snap.com [whois.domaintools.com] is in Pasadena. IP Range: 66.148.0.0 to 66.151.255.255
I recommend that you always allow access to robots.txt and to your custom 403 error page (if you use one).
If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only
User-agent: *
Disallow: /
Jim
One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.We shall see if it intentionally tries to crawl forbidden robots.txt paths...
66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
Pfui,
I denied this range on April 23 of this year.
The pattern would mean nothing to anybody but myeslf, however the visits are just not consistent with my web pages.
66.234.139.209 - - [18/Apr/2006:11:55:57 -0700] "GET /myFolder/SubFolderwinn/mypgae.html HTTP/1.1" 200 11607 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
the word "Linux" gets my attention rather fast, only having an occasional visitor (again my visitor patterns) with Linux in the UA.
Don
I haven't seen it make consecutive requests for robots.txt, maybe I've just been lucky. But the bot has seemed to honour the exclusions, with three exceptions as follows
In one instance it asked for the stylesheet, banner graphic and favicon this way:
66.234.139.198 - - [02/Jun/2006:22:58:37 +1000] "GET /favicon.ico HTTP/1.1" 200 2238 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
All those files are disallowed in robots.txt. But as it wasn't technically the bot asking for them, I guess I can't complain :-/
The range of IPs I've seen the bot coming from are: 66.234.139.194 to 66.234.139.220.
Anyway, now it will be feeding on a diet of 403s.
but i still don't know what the heck psycheclone is... i swear
Have a look at this thread ;)
[webmasterworld.com...]
If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing onlyUser-agent: *
Disallow: /to minimize wasted bandwidth.
how do you mean cloaked version?
should i just do a wildcard in the robots.txt file for 5 minutes or actually specify the bot?
At least Psycheclone seems to have slung its hook for now!
<sigh>It's only recently I've started trying to learn about search engines/spiders and who's good v who's bad. I guess I've got a lot to get my head around and some fun (?) times ahead </sigh>
FWIW, bandwidth-wise, I'm now sending a really, really small 'page' -- a custom text 403 vis .htaccess and the ErrorDocument "Text Method [bignosebird.com]" (see also: Apache's specifics [httpd.apache.org]). Nifty, quick and no rewriting. Note: Must be over 512k to get around Explorer's custom default.
malachite, bot-watching/catching, like the Internet, is an addiction for which there is no cure:) Welcome to the club!
To reduce bandwidth wasted on feeding a large robots.txt file to bad or useless bots, you can do this (on Apache):
New plain-text file named "bad-bots.txt":
User-agent: *
Disallow: /
In top-level .htaccess:
# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]
I've already posted about serving a tiny 403 response to known pests based on the same principles, so I won't repeat that here.
Jim
Oh, and FWIW, I didn't miss your point about 403s and robots.txt. Actually, I fundamentally agree because my robots.txt files are basically open to all but known abuser types or increasingly suspect repeaters. ("Pest" is almost too innocuous in my book for the majority of automatons I 403-block.)
66.234.139.210 - - [05/Jun/2006:00:59:10 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.194 - - [05/Jun/2006:01:46:58 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [05/Jun/2006:02:14:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.115 - - [05/Jun/2006:12:06:25 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.116 - - [05/Jun/2006:14:44:35 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
(9-plus times/day. I feel a firewall rule coming on...)
The range 65.38.102.0/24 identifies itself as:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
Screen shots made from those IP's show up in Snap, so I'm positive, as the site has a screen shot of my bot blocker's error message containing that IP and user agent on the page.
At this point I stopped chasing robots, waste of time, and I'm blocking out complete hosting server farms and opening up holes for allowed bots to access the site. Blocking BBCOM completely is how I stumbled on the Linux Firefox crawling from BBCOM, as well as a bunch of other junk.
The only innocent victims of my methodology are people working in these hosting farms are locked out if they try to access my site from inside their IP block, but it's minimal and a worth the rish vs. all the time wasted otherwise.
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1
-- and with a visiting pattern that looks veri iffy.
.
P.S.
Regardless, Snap.com has a screenshot of my main site (can't tell date), in obvious defiance of robots.txt.
Here's a couple of blocks definitely used by SnapBot to take pictures:
65.38.102.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
38.98.19.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
66.234.139.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
I know this for a fact as I found a picture of my error message about their site being banned with their IP number embedded in it from the 65.38.102. block and the others are known Snap IPs.
Now it's suddenly starting to make a lot of sense what's going on with all the groups of Linux Firefox I'm seeing hit my site.
65.38.102.147 - - [27/Jun/2006:12:29:12 -0700] "GET /myfoler/mypage.html HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"
Go to there search.
Bring up one of your pages and select view this site.
Then look at your logs.