Welcome to WebmasterWorld Guest from 54.197.94.141

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

Snapbot

Anyone know what it is?

   
4:20 am on Jun 2, 2006 (gmt 0)

5+ Year Member



I've never seen this before, but it has just completely spidered an obscure site, which has only been live about one month and only has one incoming link.

Has anyone else seen it and have any information about what it does? I looked in Google and found a suggestion it might belong to Snap search [snap.com...]

Agent: Snapbot/1.0
IP: 66.234.139.#*$!

6:21 pm on Jun 2, 2006 (gmt 0)

5+ Year Member



Snapbot seems to be a khtml2png based screenshot engine [dev.upian.com].

Back in 2005, this bot called itself "snap.com beta crawler v0" [webmasterworld.com].

IP range is not subdelegated and reverse DNS lookups fail. It's not from Snap.com / Idealab [psychedelix.com]?!

7:54 pm on Jun 2, 2006 (gmt 0)

5+ Year Member



Turned up on one of my sites yesterday, and on two more today. Can't find any real info on it, so denied it. It doesn't seem to visit robots.txt, just dives straight in at random, one page per IP.
8:10 pm on Jun 2, 2006 (gmt 0)

5+ Year Member



Thanks for the info thetrasher.

I'm feeling very uneasy about an unknown organisation taking snapshots of all my web pages, so I'll be denying it asap.

I checked the logs of the one site which does have a link to the new site mentioned in my first message and that is where the bot found the link.

malachite: In both sites I've seen it visit, it asked for robots.txt, but of course I have no way of knowing if it would obey one, as I have never seen or heard of the bot previously.

<makes a mental note to never replace IP numbers with three x's - it gets changed into looking like an obfuscated swear word :o >

1:12 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.

We shall see if it intentionally tries to crawl forbidden robots.txt paths...

66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.209 - - [22/May/2006:18:15:59 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.210 - - [23/May/2006:18:45:18 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:08:42 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:37:06 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.194 - - [24/May/2006:15:55:11 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.205 - - [25/May/2006:12:14:14 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.194 - - [25/May/2006:19:46:33 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.214 - - [30/May/2006:18:44:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.212 - - [30/May/2006:21:07:00 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [30/May/2006:21:33:45 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.197 - - [31/May/2006:00:11:19 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.216 - - [01/Jun/2006:15:57:46 -0700] "GET /robots.txt HTTP/1.0" 403 772 "-" "Snapbot/1.0"
66.234.139.204 - - [01/Jun/2006:15:57:47 -0700] "GET /directory/file.html HTTP/1.0" 403 772 "-" "Snapbot/1.0"

FWIW, WHOIS e-mail addresses show the IPs belong to BBCOM via bb.com and bb2.net [bb2.net]; the latter routes back to bb.com, out of Backbone Communications in LA. IP Range: 66.234.128.0 to 66.234.255.255 Snap.com [whois.domaintools.com] is in Pasadena. IP Range: 66.148.0.0 to 66.151.255.255

1:40 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Just remember that it is standard practice to interpret an inaccessible robots.txt file as meaning that the site may be spidered without restriction. In other words, if robots.txt cannot be accessed, then spiders act as if robots.txt was non-existent or blank.

I recommend that you always allow access to robots.txt and to your custom 403 error page (if you use one).

If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only


User-agent: *
Disallow: /

to minimize wasted bandwidth.

Jim

1:47 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.

We shall see if it intentionally tries to crawl forbidden robots.txt paths...

66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"

Pfui,
I denied this range on April 23 of this year.

The pattern would mean nothing to anybody but myeslf, however the visits are just not consistent with my web pages.

66.234.139.209 - - [18/Apr/2006:11:55:57 -0700] "GET /myFolder/SubFolderwinn/mypgae.html HTTP/1.1" 200 11607 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

the word "Linux" gets my attention rather fast, only having an occasional visitor (again my visitor patterns) with Linux in the UA.

Don

1:53 am on Jun 3, 2006 (gmt 0)

5+ Year Member



It is one heck of a busy bot! I've been checking the logs of all our sites and so far, it has only missed one out of sixteen in the last 20 hours and most are not linked to each other, nor even related topics.

I haven't seen it make consecutive requests for robots.txt, maybe I've just been lucky. But the bot has seemed to honour the exclusions, with three exceptions as follows

In one instance it asked for the stylesheet, banner graphic and favicon this way:

66.234.139.198 - - [02/Jun/2006:22:58:37 +1000] "GET /favicon.ico HTTP/1.1" 200 2238 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

All those files are disallowed in robots.txt. But as it wasn't technically the bot asking for them, I guess I can't complain :-/

The range of IPs I've seen the bot coming from are: 66.234.139.194 to 66.234.139.220.

Anyway, now it will be feeding on a diet of 403s.

3:56 am on Jun 3, 2006 (gmt 0)

5+ Year Member



I just got hit by them, too.
66.234.139.207
66.234.139.219
66.234.139.196
66.234.139.202
66.234.139.199
66.234.139.212
66.234.139.215
66.234.139.201

-Wonder why they hit from so many IPs

5:14 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Odd, i tried to post about this 2 days ago in this forum but it never showed up, I have snapbot and another one called psycheclone which had an address of 208.66.195.8

snapbot hit my site with these...
66.234.139.212
66.234.139.109
66.234.139.203
66.234.139.217

5:28 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



after my last post and looked at all the posts in this and noticed the domain in the first one. went out to smoke and it occurred to me that a week ago i went to the site and did a search for results containing my site. i think that triggered the bot to come take a look.
but i still don't know what the heck psycheclone is... i swear
5:57 am on Jun 3, 2006 (gmt 0)

5+ Year Member



but i still don't know what the heck psycheclone is... i swear

Have a look at this thread ;)

[webmasterworld.com...]

8:06 am on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



thanks for the link, yet another thread i feel at home on.
If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only

User-agent: *
Disallow: /

to minimize wasted bandwidth.

how do you mean cloaked version?

should i just do a wildcard in the robots.txt file for 5 minutes or actually specify the bot?

5:18 pm on Jun 3, 2006 (gmt 0)

5+ Year Member



Persistent little bar-steward, this one. After denying a whole load of IPs yesterday, it's hit me again today with another lot, which will be eating 403s from now on.

At least Psycheclone seems to have slung its hook for now!

<sigh>It's only recently I've started trying to learn about search engines/spiders and who's good v who's bad. I guess I've got a lot to get my head around and some fun (?) times ahead </sigh>

5:30 pm on Jun 3, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



youfoundjake, I share your question. Jim, if selective cloaking means using cookies somehow, do you have any site pointers for info? I've looked into cookies + robots.txt before and never quite understood the how-to. (And if cookies, would you as a mod prefer I started a new topic in robots.txt [webmasterworld.com]?)

FWIW, bandwidth-wise, I'm now sending a really, really small 'page' -- a custom text 403 vis .htaccess and the ErrorDocument "Text Method [bignosebird.com]" (see also: Apache's specifics [httpd.apache.org]). Nifty, quick and no rewriting. Note: Must be over 512k to get around Explorer's custom default.

malachite, bot-watching/catching, like the Internet, is an addiction for which there is no cure:) Welcome to the club!

9:38 pm on Jun 4, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Amen to that Pfui!

Here's another IP Address to add for this bot:
66.234.139.218

It read but did not respect robots.txt.

10:04 pm on Jun 4, 2006 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



My point above, which seems to have been missed, is that if you serve a 403 in response to a request for robots.txt, then the requesting robot can and should feel free to spider your entire site. That is the default behaviour of robots.txt; if it's missing, blank, or inaccessible, then robots are welcome. So be careful.

To reduce bandwidth wasted on feeding a large robots.txt file to bad or useless bots, you can do this (on Apache):

New plain-text file named "bad-bots.txt":

User-agent: *
Disallow: /

In top-level .htaccess:

# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

I've already posted about serving a tiny 403 response to known pests based on the same principles, so I won't repeat that here.

Jim

11:20 pm on Jun 4, 2006 (gmt 0)

5+ Year Member



Here's another IP Address to add for this bot:
66.234.139.218

That fits into the range that I reported above.

I've noted it coming from virtually every IP between 66.234.139.194-220

If anyone sees it coming from an IP outside that range please let us know.

1:33 am on Jun 5, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Thanks for the info, Jim! I'm relieved to learn it's not a cookies-based thing:) And your technique will work well for me because it spares my sending (in addition to access_ and rewrite_logging) ~15k of robots.txt thousands of times to bots/apps/UAs/IPs/ISPs I don't want visiting in the first place.

Oh, and FWIW, I didn't miss your point about 403s and robots.txt. Actually, I fundamentally agree because my robots.txt files are basically open to all but known abuser types or increasingly suspect repeaters. ("Pest" is almost too innocuous in my book for the majority of automatons I 403-block.)

8:35 pm on Jun 5, 2006 (gmt 0)

10+ Year Member



38.98.19.83-126

A Performance Systems International Inc. range.

10:57 pm on Jun 5, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



And now (drum roll) two network blocks in one day --

66.234.139.210 - - [05/Jun/2006:00:59:10 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.194 - - [05/Jun/2006:01:46:58 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [05/Jun/2006:02:14:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.115 - - [05/Jun/2006:12:06:25 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.116 - - [05/Jun/2006:14:44:35 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"

(9-plus times/day. I feel a firewall rule coming on...)

12:05 am on Jun 8, 2006 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



They also hit your site as FireFox Linux, which I'm assuming is being used to make those screen shots.

The range 65.38.102.0/24 identifies itself as:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Screen shots made from those IP's show up in Snap, so I'm positive, as the site has a screen shot of my bot blocker's error message containing that IP and user agent on the page.

At this point I stopped chasing robots, waste of time, and I'm blocking out complete hosting server farms and opening up holes for allowed bots to access the site. Blocking BBCOM completely is how I stumbled on the Linux Firefox crawling from BBCOM, as well as a bunch of other junk.

The only innocent victims of my methodology are people working in these hosting farms are locked out if they try to access my site from inside their IP block, but it's minimal and a worth the rish vs. all the time wasted otherwise.

12:22 am on Jun 8, 2006 (gmt 0)

10+ Year Member



I know that they do use the linux UA for the screen shots. I show their UA to them when they eat my 403 and that is what I see when I "view" my site on theirs.
12:34 am on Jun 8, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



BILL, does that CIDR pertain at all to v***center.com because I just grepped for the IP and found a handful of "ip-65-38-102-*." variations, ALL using --

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1

-- and with a visiting pattern that looks veri iffy.

.
P.S.
Regardless, Snap.com has a screenshot of my main site (can't tell date), in obvious defiance of robots.txt.

12:57 am on Jun 9, 2006 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Ah, well, Snapbot appears to have a multiple data centers for both the crawler and the service making the images.

Here's a couple of blocks definitely used by SnapBot to take pictures:

65.38.102.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

38.98.19.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

66.234.139.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

I know this for a fact as I found a picture of my error message about their site being banned with their IP number embedded in it from the 65.38.102. block and the others are known Snap IPs.

Now it's suddenly starting to make a lot of sense what's going on with all the groups of Linux Firefox I'm seeing hit my site.

7:34 pm on Jun 15, 2006 (gmt 0)

5+ Year Member



# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

Sorry do we just disallow snapbot in bad-bots.txt?

User-agent: *
Disallow: /

10:42 pm on Jun 18, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Added this to my restrictions, but it went ahead and jumped a page anyway.
5:51 pm on Jun 27, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



So is this a bad bot or a good bot? This bugger has been crawling like crazy recently. Can anyone tell me what company its from?
7:41 pm on Jun 27, 2006 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The following is SNAP also.

65.38.102.147 - - [27/Jun/2006:12:29:12 -0700] "GET /myfoler/mypage.html HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Go to there search.
Bring up one of your pages and select view this site.
Then look at your logs.

12:53 am on Jun 28, 2006 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Already covered in msg #35 - 3 data centers using SnapBot to crawl and Linux Firefox to make screen shots and it all shows up on the same IPs.
This 33 message thread spans 2 pages: 33
 

Featured Threads

Hot Threads This Week

Hot Threads This Month