Forum Moderators: open

Message Too Old, No Replies

"Phantom" bot

help in identifying invisible bot

         

bobmark

7:27 pm on Mar 10, 2003 (gmt 0)

10+ Year Member



I have a weird problem.
My pages translate into 6 languages. In the past several unruly bots who followed every link would translate every page 6 times, causing me problems with my translation service as these bots would request thousands of translations per hour.
However, I could always find and ban the bots as there would be an initial request for widgets.html followed by the 6 translation requests in my logs (the translation "GET's" show only the translation service's IP address).
This time I am stumped. I can see the six translation requests - in come cases for pages temporaraly rendered inactive on my site by a "noindex,nofollow" robots meta BUT I have no corresponding log entry for the initial page request by the bot.
Apparently something is crawling my site but no log entries are written.
Anybody know how to solve this?
Mark

weesnich

8:50 pm on Mar 10, 2003 (gmt 0)

10+ Year Member



Perhaps a cache-related problem?

I.E. Bot run by an aol-user gets your main site from cache as it was visited recently, but your other pages aren't. Or the bot gets its startingpoint from google-cache, but you'll never see this since it requests no images and delivers no referrers (this should only happen if google has not yet found your noindex, nofollow - tag)

How about adding some a bot-trap-links which should be invisible to normal visitors, but will be visited by robots. This might give you a clear signal of spidering going on.

Lets see what ideas the fellow members have...

bobmark

2:53 am on Mar 11, 2003 (gmt 0)

10+ Year Member



Thanks for your help weesnitch.
Google has found the noindex,nofollow so that isn't a factor.
I spent about an hour or more going through the log line by line and it is absolutely weird.
I have translation requests for pages with zero page requests - i.e. the ones that are not indexed - so, strange as it seems, it is a like a bot that writes no line to the log file. It is like I am seeing its shadow by the translation requests.
Because of that, it's hard to see how a bot trap would help me. In essence my non-indexed pages ARE a bot trap but the only thing I could block from them is my translation service which obviously I don't want to do. Honestly there is no log entry for any page request on some of these pages other than the GET from the translation service which is only triggered if the link to it is clicked.
I agree with you on the cache thing, but am at a loss to see how someone could, apparently at will, apparently work from a cache of all - or most anyway - of my pages.

bobmark

3:58 pm on Mar 11, 2003 (gmt 0)

10+ Year Member



sorry to keep replying to my own but it is too late to owner edit my previous.
All I or my hosting company can think is someone cached all pages and is now crawling their cache. I have Net Vampire and all the usual leech programs banned but obviously it could be new (or at least new to me).
When you think about it, it is a smart idea for an email harvester or other malicious bot: do unobtrusive GETs over a fairly long period, cache them, then hit the site in a rush and leave no trace (I would have to go back through previous logs and guess/detect the initial GET's).
This is new to me but maybe not to others on here.
If I find the IP address, I'll post it here for banning.