my bandwidth being used up by 'live' crawler

Forum Moderators: mack

Message Too Old, No Replies

my bandwidth being used up by 'live' crawler

redirect seems to cause problem

mjb2

7:25 pm on Nov 29, 2007 (gmt 0)

I am getting a lot of multiple reads (100's) of particular files on my site from a crawler (or spider or whatever the correct name is) identifying itself as 'live'.

This appears to happen because I have many files redirecting to the file being hit like this:

There are multiple files redirecting to the file 'books.htm' but when the crawler is redirected to the file in this way it does not appear to remember that it has been redirected there before so it reads it again. So the file 'books.htm' is getting hundreds of hits.

This is using up a lot of bandwidth is there any way to stop it?

Martin

jonrichd

11:40 pm on Dec 1, 2007 (gmt 0)

If indeed your theory is correct, then one way to stop it would be to use robots.txt to tell SE spiders not to index the file(s) that are getting redirected to books.htm. Since the files have no content anyway, I don't see any harm in dropping them from the index.

Another possibility would be to use a 301 redirect instead of the meta refresh to tell robots that the content has moved permanently to the new URL. I'm not sure how the Live crawler interprets meta refreshes, but at one point, they were frowned upon as a possible spamming technique.

The third thing to do is to make sure you have no internal links to the files you are using meta refreshes on, and then just delete them. The only reason you might not want to do this would be if there were external links pointing to these files, and in this case, I think the 301 redirect would be better.

mjb2

4:07 pm on Dec 2, 2007 (gmt 0)

Thanks for the reply, I realise what I'm doing is not ideal but as far as I could see it is valid, its been like that for many years and no other spiders have had problems with it until I stated getting massive hits apparently from 'live' starting about 3 months ago.

The reason I did it that way was that I have many thousand HTML pages arranged in quite a steep heirachy, each page has a 'further reading' link built into the page template, this link goes to a books.htm page in the same directory. If I don't have any books for that page I want it to redirect to the books.htm page at the next level down the directory structure. So I just put a books.htm file which redirects to ../books.htm.

That way when I get a book list for a particular page I just have to replace the one redirect file with a page containing the book list. All the alternatives I could think of seem to involve either,
1) updating all the incoming links each time I make a change.
2) maintaining a separate file with all the pages.
3) writing some sort of script to look down the tree for the first book.htm

I wanted to avoid these things and its annoying if I have to change the whole of my site because of a bug in the 'live' spider.

Martin