Thunderstone spider violating robots.txt?

Forum Moderators: bakedjake

Message Too Old, No Replies

Thunderstone spider violating robots.txt?

Skimmer_lid

6:06 pm on Apr 10, 2001 (gmt 0)

Can someone tell me what's going on with the Thunderstone spider?

I found this line in my logs:

208.51.0.74 - - [09/Apr/2001:18:29:36 -0400] "GET /dummy/index.html HTTP/1.0" 200 1832 "-" "Mozilla/2.0 (compatible; T-H-U-N-D-E-R-S-T-O-N-E)"

What's interesting here, is that the "dummy" directory is a 'spider trap',
and is listed ONLY in the robots.txt "do not enter" list. In other words,
the only way that the spider knows about that directory is by reading, and
then violating, the "robots.txt" file.

I've seen this before, and I'm beginning to be inclined to block both the
user agent, and the Thunderstone IP blocks ( 208.51.0.0 - 208.51.3.255 ).

Does anyone know more?

Ben

jeremy goodrich

7:06 pm on Apr 10, 2001 (gmt 0)

Welcome to WebmasterWorld, Ben. This could have happened for a number of reasons:

the robots.txt file could be configured improperly, there is a tool at [searchengineworld.com...] for checking your sytax.

Also, if the spider visited your site before, and knew about the directoy, and then you changed the permissions through robots.txt, the spider might not have "known" that the directory was now forbidden to it.

This spider powers the dogpile directory, which is a handy place to be, since dogpile does generate a lot of referrals. You can read about the dogpile web directory by following this link: [dpcatalog.dogpile.com...]

The T-H-U-N-D-E-R-S-T-O-N-E spider is made by a company called Thunderstone, you can visit their page here: [thunderstone.com...]

I looked through their information, and I didn't see anything readily available about if their product obeys robots.txt or not. You might try contacting Thunderstone, since this sounds like it could be a serious issue with an impact on the entire webmaster community.

Hope this helps, keep us posted.

Skimmer_lid

9:04 pm on Apr 10, 2001 (gmt 0)

I'll doublecheck my robots.txt syntax, but . . .

To date, the only accesses to that directory have been from emailsucking spam spiders, an image copyright spider, Thunderstone, and a some script-kiddies. None of the major search engine bots (Google, FAST, AV, etc.) have ever touched it.

FWIW, I'll explain how I've set it up: others here might be interested in the results.

=> create a wierd named directory
=> refer to it and BLOCK it in robots.txt
=> create a wierd named page inside the directory.
=> put a clear 1 pixel gif somewhere out of site on a fairly prominent page, and link it to wierdly named page.
=> never, NEVER refer to it in any other way (as you might guess, the actual directory name is not the "dummy" reported above).
=> search your logs for accesses to the directory. Basically, all accesses are suspicious, since there is no legitimate way to get there.
=> you can use SSI's and Javascript to create a browser-crashing, info-gathering page that will maximize the info you collect about the occasional script kiddie or competitor who's snooping your page code. I caught a former subscriber turned wannabe competitor (a senior network engineer at a MAJOR computer company!) poking around that way.

Ben