Snapbot

Forum Moderators: open

Message Too Old, No Replies

Snapbot

Anyone know what it is?

Mokita

4:20 am on Jun 2, 2006 (gmt 0)

I've never seen this before, but it has just completely spidered an obscure site, which has only been live about one month and only has one incoming link.

Has anyone else seen it and have any information about what it does? I looked in Google and found a suggestion it might belong to Snap search [snap.com...]

Agent: Snapbot/1.0
IP: 66.234.139.#*$!

thetrasher

6:21 pm on Jun 2, 2006 (gmt 0)

Snapbot seems to be a khtml2png based screenshot engine [dev.upian.com].

Back in 2005, this bot called itself "snap.com beta crawler v0" [webmasterworld.com].

IP range is not subdelegated and reverse DNS lookups fail. It's not from Snap.com / Idealab [psychedelix.com]?!

malachite

7:54 pm on Jun 2, 2006 (gmt 0)

Turned up on one of my sites yesterday, and on two more today. Can't find any real info on it, so denied it. It doesn't seem to visit robots.txt, just dives straight in at random, one page per IP.

Mokita

8:10 pm on Jun 2, 2006 (gmt 0)

Thanks for the info thetrasher.

I'm feeling very uneasy about an unknown organisation taking snapshots of all my web pages, so I'll be denying it asap.

I checked the logs of the one site which does have a link to the new site mentioned in my first message and that is where the bot found the link.

malachite: In both sites I've seen it visit, it asked for robots.txt, but of course I have no way of knowing if it would obey one, as I have never seen or heard of the bot previously.

Pfui

1:12 am on Jun 3, 2006 (gmt 0)

One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.

We shall see if it intentionally tries to crawl forbidden robots.txt paths...

66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.209 - - [22/May/2006:18:15:59 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.210 - - [23/May/2006:18:45:18 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:08:42 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:37:06 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.194 - - [24/May/2006:15:55:11 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.205 - - [25/May/2006:12:14:14 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.194 - - [25/May/2006:19:46:33 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.214 - - [30/May/2006:18:44:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.212 - - [30/May/2006:21:07:00 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [30/May/2006:21:33:45 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.197 - - [31/May/2006:00:11:19 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.216 - - [01/Jun/2006:15:57:46 -0700] "GET /robots.txt HTTP/1.0" 403 772 "-" "Snapbot/1.0"
66.234.139.204 - - [01/Jun/2006:15:57:47 -0700] "GET /directory/file.html HTTP/1.0" 403 772 "-" "Snapbot/1.0"

FWIW, WHOIS e-mail addresses show the IPs belong to BBCOM via bb.com and bb2.net [bb2.net]; the latter routes back to bb.com, out of Backbone Communications in LA. IP Range: 66.234.128.0 to 66.234.255.255 Snap.com [whois.domaintools.com] is in Pasadena. IP Range: 66.148.0.0 to 66.151.255.255

jdMorgan

1:40 am on Jun 3, 2006 (gmt 0)

Just remember that it is standard practice to interpret an inaccessible robots.txt file as meaning that the site may be spidered without restriction. In other words, if robots.txt cannot be accessed, then spiders act as if robots.txt was non-existent or blank.

I recommend that you always allow access to robots.txt and to your custom 403 error page (if you use one).

If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only


User-agent: *
Disallow: /

to minimize wasted bandwidth.

Jim

wilderness

1:47 am on Jun 3, 2006 (gmt 0)

One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.
We shall see if it intentionally tries to crawl forbidden robots.txt paths...
66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"

Pfui,
I denied this range on April 23 of this year.

The pattern would mean nothing to anybody but myeslf, however the visits are just not consistent with my web pages.

66.234.139.209 - - [18/Apr/2006:11:55:57 -0700] "GET /myFolder/SubFolderwinn/mypgae.html HTTP/1.1" 200 11607 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

the word "Linux" gets my attention rather fast, only having an occasional visitor (again my visitor patterns) with Linux in the UA.

Don

Mokita

1:53 am on Jun 3, 2006 (gmt 0)

It is one heck of a busy bot! I've been checking the logs of all our sites and so far, it has only missed one out of sixteen in the last 20 hours and most are not linked to each other, nor even related topics.

I haven't seen it make consecutive requests for robots.txt, maybe I've just been lucky. But the bot has seemed to honour the exclusions, with three exceptions as follows

In one instance it asked for the stylesheet, banner graphic and favicon this way:

66.234.139.198 - - [02/Jun/2006:22:58:37 +1000] "GET /favicon.ico HTTP/1.1" 200 2238 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

All those files are disallowed in robots.txt. But as it wasn't technically the bot asking for them, I guess I can't complain :-/

The range of IPs I've seen the bot coming from are: 66.234.139.194 to 66.234.139.220.

Anyway, now it will be feeding on a diet of 403s.

fusion5

3:56 am on Jun 3, 2006 (gmt 0)

I just got hit by them, too.
66.234.139.207
66.234.139.219
66.234.139.196
66.234.139.202
66.234.139.199
66.234.139.212
66.234.139.215
66.234.139.201

-Wonder why they hit from so many IPs

youfoundjake

5:14 am on Jun 3, 2006 (gmt 0)

Odd, i tried to post about this 2 days ago in this forum but it never showed up, I have snapbot and another one called psycheclone which had an address of 208.66.195.8

snapbot hit my site with these...
66.234.139.212
66.234.139.109
66.234.139.203
66.234.139.217

youfoundjake

5:28 am on Jun 3, 2006 (gmt 0)

after my last post and looked at all the posts in this and noticed the domain in the first one. went out to smoke and it occurred to me that a week ago i went to the site and did a search for results containing my site. i think that triggered the bot to come take a look.
but i still don't know what the heck psycheclone is... i swear

Mokita

5:57 am on Jun 3, 2006 (gmt 0)

but i still don't know what the heck psycheclone is... i swear

Have a look at this thread ;)

[webmasterworld.com...]

youfoundjake

8:06 am on Jun 3, 2006 (gmt 0)

thanks for the link, yet another thread i feel at home on.

If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only
User-agent: *
Disallow: /
to minimize wasted bandwidth.

how do you mean cloaked version?

should i just do a wildcard in the robots.txt file for 5 minutes or actually specify the bot?

malachite

5:18 pm on Jun 3, 2006 (gmt 0)

Persistent little bar-steward, this one. After denying a whole load of IPs yesterday, it's hit me again today with another lot, which will be eating 403s from now on.

At least Psycheclone seems to have slung its hook for now!

<sigh>It's only recently I've started trying to learn about search engines/spiders and who's good v who's bad. I guess I've got a lot to get my head around and some fun (?) times ahead </sigh>

Pfui

5:30 pm on Jun 3, 2006 (gmt 0)

youfoundjake, I share your question. Jim, if selective cloaking means using cookies somehow, do you have any site pointers for info? I've looked into cookies + robots.txt before and never quite understood the how-to. (And if cookies, would you as a mod prefer I started a new topic in robots.txt [webmasterworld.com]?)

FWIW, bandwidth-wise, I'm now sending a really, really small 'page' -- a custom text 403 vis .htaccess and the ErrorDocument "Text Method [bignosebird.com]" (see also: Apache's specifics [httpd.apache.org]). Nifty, quick and no rewriting. Note: Must be over 512k to get around Explorer's custom default.

malachite, bot-watching/catching, like the Internet, is an addiction for which there is no cure:) Welcome to the club!

GaryK

9:38 pm on Jun 4, 2006 (gmt 0)

Amen to that Pfui!

Here's another IP Address to add for this bot:
66.234.139.218

It read but did not respect robots.txt.

jdMorgan

10:04 pm on Jun 4, 2006 (gmt 0)

My point above, which seems to have been missed, is that if you serve a 403 in response to a request for robots.txt, then the requesting robot can and should feel free to spider your entire site. That is the default behaviour of robots.txt; if it's missing, blank, or inaccessible, then robots are welcome. So be careful.

To reduce bandwidth wasted on feeding a large robots.txt file to bad or useless bots, you can do this (on Apache):

New plain-text file named "bad-bots.txt":

User-agent: *
Disallow: /

In top-level .htaccess:

# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

I've already posted about serving a tiny 403 response to known pests based on the same principles, so I won't repeat that here.

Jim

Mokita

11:20 pm on Jun 4, 2006 (gmt 0)

Here's another IP Address to add for this bot:
66.234.139.218

That fits into the range that I reported above.

I've noted it coming from virtually every IP between 66.234.139.194-220

If anyone sees it coming from an IP outside that range please let us know.

Pfui

1:33 am on Jun 5, 2006 (gmt 0)

Thanks for the info, Jim! I'm relieved to learn it's not a cookies-based thing:) And your technique will work well for me because it spares my sending (in addition to access_ and rewrite_logging) ~15k of robots.txt thousands of times to bots/apps/UAs/IPs/ISPs I don't want visiting in the first place.

Oh, and FWIW, I didn't miss your point about 403s and robots.txt. Actually, I fundamentally agree because my robots.txt files are basically open to all but known abuser types or increasingly suspect repeaters. ("Pest" is almost too innocuous in my book for the majority of automatons I 403-block.)

fiestagirl

8:35 pm on Jun 5, 2006 (gmt 0)

38.98.19.83-126

A Performance Systems International Inc. range.

Pfui

10:57 pm on Jun 5, 2006 (gmt 0)

And now (drum roll) two network blocks in one day --

66.234.139.210 - - [05/Jun/2006:00:59:10 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.194 - - [05/Jun/2006:01:46:58 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [05/Jun/2006:02:14:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.115 - - [05/Jun/2006:12:06:25 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.116 - - [05/Jun/2006:14:44:35 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"

(9-plus times/day. I feel a firewall rule coming on...)

incrediBILL

12:05 am on Jun 8, 2006 (gmt 0)

They also hit your site as FireFox Linux, which I'm assuming is being used to make those screen shots.

The range 65.38.102.0/24 identifies itself as:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Screen shots made from those IP's show up in Snap, so I'm positive, as the site has a screen shot of my bot blocker's error message containing that IP and user agent on the page.

At this point I stopped chasing robots, waste of time, and I'm blocking out complete hosting server farms and opening up holes for allowed bots to access the site. Blocking BBCOM completely is how I stumbled on the Linux Firefox crawling from BBCOM, as well as a bunch of other junk.

The only innocent victims of my methodology are people working in these hosting farms are locked out if they try to access my site from inside their IP block, but it's minimal and a worth the rish vs. all the time wasted otherwise.

fiestagirl

12:22 am on Jun 8, 2006 (gmt 0)

I know that they do use the linux UA for the screen shots. I show their UA to them when they eat my 403 and that is what I see when I "view" my site on theirs.

Pfui

12:34 am on Jun 8, 2006 (gmt 0)

BILL, does that CIDR pertain at all to v***center.com because I just grepped for the IP and found a handful of "ip-65-38-102-*." variations, ALL using --

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1

-- and with a visiting pattern that looks veri iffy.

.
P.S.
Regardless, Snap.com has a screenshot of my main site (can't tell date), in obvious defiance of robots.txt.

incrediBILL

12:57 am on Jun 9, 2006 (gmt 0)

Ah, well, Snapbot appears to have a multiple data centers for both the crawler and the service making the images.

Here's a couple of blocks definitely used by SnapBot to take pictures:

65.38.102.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

38.98.19.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

66.234.139.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

I know this for a fact as I found a picture of my error message about their site being banned with their IP number embedded in it from the 65.38.102. block and the others are known Snap IPs.

Now it's suddenly starting to make a lot of sense what's going on with all the groups of Linux Firefox I'm seeing hit my site.

swapshop

7:34 pm on Jun 15, 2006 (gmt 0)

# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

Sorry do we just disallow snapbot in bad-bots.txt?

User-agent: *
Disallow: /

youfoundjake

10:42 pm on Jun 18, 2006 (gmt 0)

Added this to my restrictions, but it went ahead and jumped a page anyway.

trinorthlighting

5:51 pm on Jun 27, 2006 (gmt 0)

So is this a bad bot or a good bot? This bugger has been crawling like crazy recently. Can anyone tell me what company its from?

wilderness

7:41 pm on Jun 27, 2006 (gmt 0)

The following is SNAP also.

65.38.102.147 - - [27/Jun/2006:12:29:12 -0700] "GET /myfoler/mypage.html HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Go to there search.
Bring up one of your pages and select view this site.
Then look at your logs.

incrediBILL

12:53 am on Jun 28, 2006 (gmt 0)

Already covered in msg #35 - 3 data centers using SnapBot to crawl and Linux Firefox to make screen shots and it all shows up on the same IPs.

This 33 message thread spans 2 pages: 33

Snapbot

Anyone know what it is?

Mokita

thetrasher

malachite

Mokita

Pfui

jdMorgan

wilderness

Mokita

fusion5

youfoundjake

youfoundjake

Mokita

youfoundjake

malachite

Pfui

GaryK

jdMorgan

Mokita

Pfui

fiestagirl

Pfui

incrediBILL

fiestagirl

Pfui

incrediBILL

swapshop

youfoundjake

trinorthlighting

wilderness

incrediBILL

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week