Forum Moderators: open
It then made a request for a search query
GET /search/search.pl?Realm=All&Match=0 &Terms=Philippines+%7c+Tanikalang+Ginto+%7c+filipinolinks.com &nocpp=1 &maxhits=10 &Rank=111 HTTP/1.0" 200 14252 "-" "ia_archiver"
On my results pages there is a link saying "similar" and if a user clicks on this link it will do another search for the title of the site that the link refers to. then each of the links on the new serp. It followed each of these link and was carrying out literaly loads of searches all for fairly long search queries. There was over 1000 searches made from this one ip address and all where automated by ia_archiver. I need to ban this one...any tips
It is still hammering me as we speak.
First thing it did was to request robotx.txt I have a robots.txt file there but it still responded with a 404...
GET /robots.txt HTTP/1.0" 404 281 "-" "ia_archiver"
Just a point of clarification first...
Your server responded with a 404-Not Found to ia's request for robots.txt... Why?
Make sure you are not blocking access to robots.txt (or to all files) for requests with a blank referer, a certain IP address range, or some other condition that would cause your server to send a 404 to the 'bot.
IA archiver will index everything it can find, so you may want to intervene and fix the 404 on robots.txt, or just block ia altogether until it quits trying to request files. It'll be back next week or next month to try again anyway, by which time you can resolve the robots.txt problem and either block it or let it into certain areas as desired.
Man, I don't know how big of a "space" a search engine represents to an archiver, but I'll bet you've eaten a big chunk of disk at IA already!
Jim
The robots.txt in question banns ia_archiver. Is this posibaly their way of ignoring being barred?
I also dont see why it should arrive with a query? my site has no links that, link directly to querie. I also took another look and it was requesting mulitple queries before it started following links.
I also dont recognise their ip address from being owned by alexia or waybackmachine?
209.237.238.164
I don't see why your server responded with a 404-Not Found either - but it did! So, not finding a robots.txt file, the user agent assumed your site was "wide open" and started requesting files.
I traced the block of IP addresses that includes the one you cited, and I have seen IA requests from that address block before. But I can't say for sure if they were spider requests or Wayback Machine-driven requests. As I recall (pretty sure, though), the Wayback machine does not cache everything from your site - and a visit to your cached page will often result in requests for images, etc. from that IP block.
You might want to visit Wayback, and see if you can actually do a search through their cached version of your search page, or if something breaks when you try it. This will also allow you to confirm the IP address you saw. If it results in requests to your site, you'll see it in your logs.
Hope this helps,
Jim
"We're sorry, this robots.txt does NOT validate.
Warnings Detected: 8
Errors Detected: 5"
My vote is to block them before they have a chance to PROFIT from selling info about me in the future.
[webmasterworld.com...]