homepage Welcome to WebmasterWorld Guest from 54.196.201.253
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 33 message thread spans 2 pages: 33 ( [1] 2 > >     
Snapbot
Anyone know what it is?
Mokita




msg:395126
 4:20 am on Jun 2, 2006 (gmt 0)

I've never seen this before, but it has just completely spidered an obscure site, which has only been live about one month and only has one incoming link.

Has anyone else seen it and have any information about what it does? I looked in Google and found a suggestion it might belong to Snap search [snap.com...]

Agent: Snapbot/1.0
IP: 66.234.139.#*$!

 

thetrasher




msg:395127
 6:21 pm on Jun 2, 2006 (gmt 0)

Snapbot seems to be a khtml2png based screenshot engine [dev.upian.com].

Back in 2005, this bot called itself "snap.com beta crawler v0" [webmasterworld.com].

IP range is not subdelegated and reverse DNS lookups fail. It's not from Snap.com / Idealab [psychedelix.com]?!

malachite




msg:395128
 7:54 pm on Jun 2, 2006 (gmt 0)

Turned up on one of my sites yesterday, and on two more today. Can't find any real info on it, so denied it. It doesn't seem to visit robots.txt, just dives straight in at random, one page per IP.

Mokita




msg:395129
 8:10 pm on Jun 2, 2006 (gmt 0)

Thanks for the info thetrasher.

I'm feeling very uneasy about an unknown organisation taking snapshots of all my web pages, so I'll be denying it asap.

I checked the logs of the one site which does have a link to the new site mentioned in my first message and that is where the bot found the link.

malachite: In both sites I've seen it visit, it asked for robots.txt, but of course I have no way of knowing if it would obey one, as I have never seen or heard of the bot previously.

<makes a mental note to never replace IP numbers with three x's - it gets changed into looking like an obfuscated swear word :o >

Pfui




msg:395130
 1:12 am on Jun 3, 2006 (gmt 0)

One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.

We shall see if it intentionally tries to crawl forbidden robots.txt paths...

66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.209 - - [22/May/2006:18:15:59 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"
66.234.139.210 - - [23/May/2006:18:45:18 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:08:42 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.220 - - [24/May/2006:12:37:06 -0700] "GET /robots.txt HTTP/1.0" 200 8632 "-" "Snapbot/1.0"
66.234.139.194 - - [24/May/2006:15:55:11 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.205 - - [25/May/2006:12:14:14 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.194 - - [25/May/2006:19:46:33 -0700] "GET /robots.txt HTTP/1.0" 200 8677 "-" "Snapbot/1.0"
66.234.139.214 - - [30/May/2006:18:44:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.212 - - [30/May/2006:21:07:00 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [30/May/2006:21:33:45 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.197 - - [31/May/2006:00:11:19 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.216 - - [01/Jun/2006:15:57:46 -0700] "GET /robots.txt HTTP/1.0" 403 772 "-" "Snapbot/1.0"
66.234.139.204 - - [01/Jun/2006:15:57:47 -0700] "GET /directory/file.html HTTP/1.0" 403 772 "-" "Snapbot/1.0"

FWIW, WHOIS e-mail addresses show the IPs belong to BBCOM via bb.com and bb2.net [bb2.net]; the latter routes back to bb.com, out of Backbone Communications in LA. IP Range: 66.234.128.0 to 66.234.255.255 Snap.com [whois.domaintools.com] is in Pasadena. IP Range: 66.148.0.0 to 66.151.255.255

jdMorgan




msg:395131
 1:40 am on Jun 3, 2006 (gmt 0)

Just remember that it is standard practice to interpret an inaccessible robots.txt file as meaning that the site may be spidered without restriction. In other words, if robots.txt cannot be accessed, then spiders act as if robots.txt was non-existent or blank.

I recommend that you always allow access to robots.txt and to your custom 403 error page (if you use one).

If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only

User-agent: *
Disallow: /

to minimize wasted bandwidth.

Jim

wilderness




msg:395132
 1:47 am on Jun 3, 2006 (gmt 0)

One second after I banned this bot for only hitting robots.txt over and over again (where it admittedly did heed a blanket Disallow), it then tried to snag a deep file specified in robots.txt as off limits.

We shall see if it intentionally tries to crawl forbidden robots.txt paths...

66.234.139.202 - - [22/May/2006:17:47:25 -0700] "GET /robots.txt HTTP/1.0" 200 8551 "-" "Snapbot/1.0"

Pfui,
I denied this range on April 23 of this year.

The pattern would mean nothing to anybody but myeslf, however the visits are just not consistent with my web pages.

66.234.139.209 - - [18/Apr/2006:11:55:57 -0700] "GET /myFolder/SubFolderwinn/mypgae.html HTTP/1.1" 200 11607 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

the word "Linux" gets my attention rather fast, only having an occasional visitor (again my visitor patterns) with Linux in the UA.

Don

Mokita




msg:395133
 1:53 am on Jun 3, 2006 (gmt 0)

It is one heck of a busy bot! I've been checking the logs of all our sites and so far, it has only missed one out of sixteen in the last 20 hours and most are not linked to each other, nor even related topics.

I haven't seen it make consecutive requests for robots.txt, maybe I've just been lucky. But the bot has seemed to honour the exclusions, with three exceptions as follows

In one instance it asked for the stylesheet, banner graphic and favicon this way:

66.234.139.198 - - [02/Jun/2006:22:58:37 +1000] "GET /favicon.ico HTTP/1.1" 200 2238 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

All those files are disallowed in robots.txt. But as it wasn't technically the bot asking for them, I guess I can't complain :-/

The range of IPs I've seen the bot coming from are: 66.234.139.194 to 66.234.139.220.

Anyway, now it will be feeding on a diet of 403s.

fusion5




msg:395134
 3:56 am on Jun 3, 2006 (gmt 0)

I just got hit by them, too.
66.234.139.207
66.234.139.219
66.234.139.196
66.234.139.202
66.234.139.199
66.234.139.212
66.234.139.215
66.234.139.201

-Wonder why they hit from so many IPs

youfoundjake




msg:395135
 5:14 am on Jun 3, 2006 (gmt 0)

Odd, i tried to post about this 2 days ago in this forum but it never showed up, I have snapbot and another one called psycheclone which had an address of 208.66.195.8

snapbot hit my site with these...
66.234.139.212
66.234.139.109
66.234.139.203
66.234.139.217

youfoundjake




msg:395136
 5:28 am on Jun 3, 2006 (gmt 0)

after my last post and looked at all the posts in this and noticed the domain in the first one. went out to smoke and it occurred to me that a week ago i went to the site and did a search for results containing my site. i think that triggered the bot to come take a look.
but i still don't know what the heck psycheclone is... i swear

Mokita




msg:395137
 5:57 am on Jun 3, 2006 (gmt 0)

but i still don't know what the heck psycheclone is... i swear

Have a look at this thread ;)

[webmasterworld.com...]

youfoundjake




msg:395138
 8:06 am on Jun 3, 2006 (gmt 0)

thanks for the link, yet another thread i feel at home on.
If you have a 'pest' UA or IP address that repeatedly fetches robots.txt, then you can feed it a smaller 'cloaked version' containing only

User-agent: *
Disallow: /

to minimize wasted bandwidth.

how do you mean cloaked version?

should i just do a wildcard in the robots.txt file for 5 minutes or actually specify the bot?

malachite




msg:395139
 5:18 pm on Jun 3, 2006 (gmt 0)

Persistent little bar-steward, this one. After denying a whole load of IPs yesterday, it's hit me again today with another lot, which will be eating 403s from now on.

At least Psycheclone seems to have slung its hook for now!

<sigh>It's only recently I've started trying to learn about search engines/spiders and who's good v who's bad. I guess I've got a lot to get my head around and some fun (?) times ahead </sigh>

Pfui




msg:395140
 5:30 pm on Jun 3, 2006 (gmt 0)

youfoundjake, I share your question. Jim, if selective cloaking means using cookies somehow, do you have any site pointers for info? I've looked into cookies + robots.txt before and never quite understood the how-to. (And if cookies, would you as a mod prefer I started a new topic in robots.txt [webmasterworld.com]?)

FWIW, bandwidth-wise, I'm now sending a really, really small 'page' -- a custom text 403 vis .htaccess and the ErrorDocument "Text Method [bignosebird.com]" (see also: Apache's specifics [httpd.apache.org]). Nifty, quick and no rewriting. Note: Must be over 512k to get around Explorer's custom default.

malachite, bot-watching/catching, like the Internet, is an addiction for which there is no cure:) Welcome to the club!

GaryK




msg:395141
 9:38 pm on Jun 4, 2006 (gmt 0)

Amen to that Pfui!

Here's another IP Address to add for this bot:
66.234.139.218

It read but did not respect robots.txt.

jdMorgan




msg:395142
 10:04 pm on Jun 4, 2006 (gmt 0)

My point above, which seems to have been missed, is that if you serve a 403 in response to a request for robots.txt, then the requesting robot can and should feel free to spider your entire site. That is the default behaviour of robots.txt; if it's missing, blank, or inaccessible, then robots are welcome. So be careful.

To reduce bandwidth wasted on feeding a large robots.txt file to bad or useless bots, you can do this (on Apache):

New plain-text file named "bad-bots.txt":
User-agent: *
Disallow: /

In top-level .htaccess:
# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

I've already posted about serving a tiny 403 response to known pests based on the same principles, so I won't repeat that here.

Jim

Mokita




msg:395143
 11:20 pm on Jun 4, 2006 (gmt 0)

Here's another IP Address to add for this bot:
66.234.139.218

That fits into the range that I reported above.

I've noted it coming from virtually every IP between 66.234.139.194-220

If anyone sees it coming from an IP outside that range please let us know.

Pfui




msg:395144
 1:33 am on Jun 5, 2006 (gmt 0)

Thanks for the info, Jim! I'm relieved to learn it's not a cookies-based thing:) And your technique will work well for me because it spares my sending (in addition to access_ and rewrite_logging) ~15k of robots.txt thousands of times to bots/apps/UAs/IPs/ISPs I don't want visiting in the first place.

Oh, and FWIW, I didn't miss your point about 403s and robots.txt. Actually, I fundamentally agree because my robots.txt files are basically open to all but known abuser types or increasingly suspect repeaters. ("Pest" is almost too innocuous in my book for the majority of automatons I 403-block.)

fiestagirl




msg:395145
 8:35 pm on Jun 5, 2006 (gmt 0)

38.98.19.83-126

A Performance Systems International Inc. range.

Pfui




msg:395146
 10:57 pm on Jun 5, 2006 (gmt 0)

And now (drum roll) two network blocks in one day --

66.234.139.210 - - [05/Jun/2006:00:59:10 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.194 - - [05/Jun/2006:01:46:58 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
66.234.139.209 - - [05/Jun/2006:02:14:26 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.115 - - [05/Jun/2006:12:06:25 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"
38.98.19.116 - - [05/Jun/2006:14:44:35 -0700] "GET /robots.txt HTTP/1.0" 200 9770 "-" "Snapbot/1.0"

(9-plus times/day. I feel a firewall rule coming on...)

incrediBILL




msg:395147
 12:05 am on Jun 8, 2006 (gmt 0)

They also hit your site as FireFox Linux, which I'm assuming is being used to make those screen shots.

The range 65.38.102.0/24 identifies itself as:
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Screen shots made from those IP's show up in Snap, so I'm positive, as the site has a screen shot of my bot blocker's error message containing that IP and user agent on the page.

At this point I stopped chasing robots, waste of time, and I'm blocking out complete hosting server farms and opening up holes for allowed bots to access the site. Blocking BBCOM completely is how I stumbled on the Linux Firefox crawling from BBCOM, as well as a bunch of other junk.

The only innocent victims of my methodology are people working in these hosting farms are locked out if they try to access my site from inside their IP block, but it's minimal and a worth the rish vs. all the time wasted otherwise.

fiestagirl




msg:395148
 12:22 am on Jun 8, 2006 (gmt 0)

I know that they do use the linux UA for the screen shots. I show their UA to them when they eat my 403 and that is what I see when I "view" my site on theirs.

Pfui




msg:395149
 12:34 am on Jun 8, 2006 (gmt 0)

BILL, does that CIDR pertain at all to v***center.com because I just grepped for the IP and found a handful of "ip-65-38-102-*." variations, ALL using --

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1

-- and with a visiting pattern that looks veri iffy.

.
P.S.
Regardless, Snap.com has a screenshot of my main site (can't tell date), in obvious defiance of robots.txt.

incrediBILL




msg:395150
 12:57 am on Jun 9, 2006 (gmt 0)

Ah, well, Snapbot appears to have a multiple data centers for both the crawler and the service making the images.

Here's a couple of blocks definitely used by SnapBot to take pictures:

65.38.102.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

38.98.19.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

66.234.139.nnn "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

I know this for a fact as I found a picture of my error message about their site being banned with their IP number embedded in it from the 65.38.102. block and the others are known Snap IPs.

Now it's suddenly starting to make a lot of sense what's going on with all the groups of Linux Firefox I'm seeing hit my site.

swapshop




msg:395151
 7:34 pm on Jun 15, 2006 (gmt 0)

# Rewrite known bad-bots to smaller robots.txt file
RewriteCond %{HTTP_USER_AGENT} ^Snapbot/
RewriteRule ^robots\.txt$ /bad-bots.txt [L]

Sorry do we just disallow snapbot in bad-bots.txt?

User-agent: *
Disallow: /

youfoundjake




msg:395152
 10:42 pm on Jun 18, 2006 (gmt 0)

Added this to my restrictions, but it went ahead and jumped a page anyway.

trinorthlighting




msg:395153
 5:51 pm on Jun 27, 2006 (gmt 0)

So is this a bad bot or a good bot? This bugger has been crawling like crazy recently. Can anyone tell me what company its from?

wilderness




msg:395154
 7:41 pm on Jun 27, 2006 (gmt 0)

The following is SNAP also.

65.38.102.147 - - [27/Jun/2006:12:29:12 -0700] "GET /myfoler/mypage.html HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1"

Go to there search.
Bring up one of your pages and select view this site.
Then look at your logs.

incrediBILL




msg:395155
 12:53 am on Jun 28, 2006 (gmt 0)

Already covered in msg #35 - 3 data centers using SnapBot to crawl and Linux Firefox to make screen shots and it all shows up on the same IPs.

This 33 message thread spans 2 pages: 33 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved