homepage Welcome to WebmasterWorld Guest from 54.234.128.25
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Microsoft / Bing Search Engine News
Forum Library, Charter, Moderators: mack

Bing Search Engine News Forum

    
my bandwidth being used up by 'live' crawler
redirect seems to cause problem
mjb2




msg:3516457
 7:25 pm on Nov 29, 2007 (gmt 0)

I am getting a lot of multiple reads (100's) of particular files on my site from a crawler (or spider or whatever the correct name is) identifying itself as 'live'.

This appears to happen because I have many files redirecting to the file being hit like this:

<html>
<head>
<meta http-equiv="refresh" content="0;URL=../books.htm">
</head>
<body bgcolor="#FFFFFF" text="#000000">
</body>
</html>

There are multiple files redirecting to the file 'books.htm' but when the crawler is redirected to the file in this way it does not appear to remember that it has been redirected there before so it reads it again. So the file 'books.htm' is getting hundreds of hits.

This is using up a lot of bandwidth is there any way to stop it?

Martin

 

jonrichd




msg:3518376
 11:40 pm on Dec 1, 2007 (gmt 0)

If indeed your theory is correct, then one way to stop it would be to use robots.txt to tell SE spiders not to index the file(s) that are getting redirected to books.htm. Since the files have no content anyway, I don't see any harm in dropping them from the index.

Another possibility would be to use a 301 redirect instead of the meta refresh to tell robots that the content has moved permanently to the new URL. I'm not sure how the Live crawler interprets meta refreshes, but at one point, they were frowned upon as a possible spamming technique.

The third thing to do is to make sure you have no internal links to the files you are using meta refreshes on, and then just delete them. The only reason you might not want to do this would be if there were external links pointing to these files, and in this case, I think the 301 redirect would be better.

mjb2




msg:3518726
 4:07 pm on Dec 2, 2007 (gmt 0)

Thanks for the reply, I realise what I'm doing is not ideal but as far as I could see it is valid, its been like that for many years and no other spiders have had problems with it until I stated getting massive hits apparently from 'live' starting about 3 months ago.

The reason I did it that way was that I have many thousand HTML pages arranged in quite a steep heirachy, each page has a 'further reading' link built into the page template, this link goes to a books.htm page in the same directory. If I don't have any books for that page I want it to redirect to the books.htm page at the next level down the directory structure. So I just put a books.htm file which redirects to ../books.htm.

That way when I get a book list for a particular page I just have to replace the one redirect file with a page containing the book list. All the alternatives I could think of seem to involve either,
1) updating all the incoming links each time I make a change.
2) maintaining a separate file with all the pages.
3) writing some sort of script to look down the tree for the first book.htm

I wanted to avoid these things and its annoying if I have to change the whole of my site because of a bug in the 'live' spider.

Martin

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Microsoft / Bing Search Engine News
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved