I am getting hit really hard...

Forum Moderators: open

Message Too Old, No Replies

I am getting hit really hard...

ia_archiver

mack

3:58 am on Sep 23, 2002 (gmt 0)

My site is a small scale search engine. Today I received a visit from ia_archiver but with a difference.
First thing it did was to request robotx.txt I have a robots.txt file there but it still responded with a 404...
GET /robots.txt HTTP/1.0" 404 281 "-" "ia_archiver"

It then made a request for a search query
GET /search/search.pl?Realm=All&Match=0 &Terms=Philippines+%7c+Tanikalang+Ginto+%7c+filipinolinks.com &nocpp=1 &maxhits=10 &Rank=111 HTTP/1.0" 200 14252 "-" "ia_archiver"

On my results pages there is a link saying "similar" and if a user clicks on this link it will do another search for the title of the site that the link refers to. then each of the links on the new serp. It followed each of these link and was carrying out literaly loads of searches all for fairly long search queries. There was over 1000 searches made from this one ip address and all where automated by ia_archiver. I need to ban this one...any tips

It is still hammering me as we speak.

jdMorgan

4:41 am on Sep 23, 2002 (gmt 0)

First thing it did was to request robotx.txt I have a robots.txt file there but it still responded with a 404...
GET /robots.txt HTTP/1.0" 404 281 "-" "ia_archiver"

Just a point of clarification first...
Your server responded with a 404-Not Found to ia's request for robots.txt... Why?
Make sure you are not blocking access to robots.txt (or to all files) for requests with a blank referer, a certain IP address range, or some other condition that would cause your server to send a 404 to the 'bot.

IA archiver will index everything it can find, so you may want to intervene and fix the 404 on robots.txt, or just block ia altogether until it quits trying to request files. It'll be back next week or next month to try again anyway, by which time you can resolve the robots.txt problem and either block it or let it into certain areas as desired.

Man, I don't know how big of a "space" a search engine represents to an archiver, but I'll bet you've eaten a big chunk of disk at IA already!

Jim

mack

4:53 am on Sep 23, 2002 (gmt 0)

Dont see why it would have been a 404... I can access of from my browser. I also ran my bot on it today and it responded ok.

The robots.txt in question banns ia_archiver. Is this posibaly their way of ignoring being barred?

I also dont see why it should arrive with a query? my site has no links that, link directly to querie. I also took another look and it was requesting mulitple queries before it started following links.

I also dont recognise their ip address from being owned by alexia or waybackmachine?

209.237.238.164

jdMorgan

5:07 am on Sep 23, 2002 (gmt 0)

mack,

I don't see why your server responded with a 404-Not Found either - but it did! So, not finding a robots.txt file, the user agent assumed your site was "wide open" and started requesting files.

I traced the block of IP addresses that includes the one you cited, and I have seen IA requests from that address block before. But I can't say for sure if they were spider requests or Wayback Machine-driven requests. As I recall (pretty sure, though), the Wayback machine does not cache everything from your site - and a visit to your cached page will often result in requests for images, etc. from that IP block.

You might want to visit Wayback, and see if you can actually do a search through their cached version of your search page, or if something breaks when you try it. This will also allow you to confirm the IP address you saw. If it results in requests to your site, you'll see it in your logs.

Hope this helps,
Jim

maccas

5:14 am on Sep 23, 2002 (gmt 0)

Dosn't answer the 404 but

"We're sorry, this robots.txt does NOT validate.
Warnings Detected: 8
Errors Detected: 5"

[searchengineworld.com...]

mack

5:49 am on Sep 23, 2002 (gmt 0)

ok google is about to update, that means the big crawl is about to happen so i have pulled all my robots.txt files down, thanks for pointing out the errors, i dont want to take the chance of chassing google away. I will fix those and replace them asap.

martin

4:18 pm on Sep 23, 2002 (gmt 0)

Are keys really case sensitive? Likew User-Agent is not a valid key but User-agent is?

/edit
My mistake not reading the docs and blaming the validator for this :-((

toolman

4:27 pm on Sep 23, 2002 (gmt 0)

Alexa is a leech. Besides they are collecting information about you without your consent just like the credit reporting agencies.

My vote is to block them before they have a chance to PROFIT from selling info about me in the future.

[webmasterworld.com...]