spiders behavior on 301 redirect

Forum Moderators: open

Message Too Old, No Replies

spiders behavior on 301 redirect

pardo

9:51 am on Jan 22, 2003 (gmt 0)

I hope this is the right Forum to ask the question. We have some pages ../widgets.html which have now been renamed to ../category-subcategory-widgets.html

The pages are indexed in several SE's. A 301 redirect will transfer the users to the new and correct url but what happens to the SE's index in that occasion? Will they spider the right/new pages next crawl?

hakre

9:55 am on Jan 22, 2003 (gmt 0)

they will do so, too. they'll replace the old with the new one. you did it the right way pardo!

pendanticist

12:41 pm on Jan 22, 2003 (gmt 0)

(Funny thing to see this subject posted here this morning because I was contemplating posting myself. Only, I'da called it "Stupid Bots/Spiders" - "Why Can't They Learn?")

Will they spider the right/new pages next crawl?

In an ideal world, yes. In the real world, not necessarily.

Ex:

####################### - - [20/Jan/2003:05:45:17 -0800] "GET /1AB1inclusion.html HTTP/1.0" 301 262 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http*//www.inktomi.com/slurp.html)"

I see that one every week. Week after week, after week, after....well you get the idea.

(I know what some would say in response to this: "Notify all those who link to the old data" - I've done that ... many times and to no avail. Either they are maintained pitifully, the site might be somewhat automated, the owner dies or the e-mail account has been abandoned.)

####################### - - [21/Jan/2003:22:38:02 -0800] "GET /map.htm HTTP/1.0" 404 2133 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http*//www.inktomi.com/slurp.html)"

That one has always been a true classic.

Never had a file name Titled: 'maps', much less .htm.

Maps maybe, but not maps.

html maybe, but not htm.

################# - - [20/Jan/2003:10:44:50 -0800] "GET /1Encyclo.html HTTP/1.1" 403 223 "-" "Mozilla/4.0 (compatible; grub-client-1.0.6; Crawl your own stuff with http*//grub.org)"
################# - - [20/Jan/2003:10:44:51 -0800] "GET /1Encyclo.html HTTP/1.1" 403 223 "-" "Mozilla/4.0 (compatible; grub-client-1.0.6; Crawl your own stuff with http*//grub.org)"

Apparently this one is so stupid it had to go back one second later to see if it really saw what it thought it saw ... see? <- A bit of levity on my part. :)

Seriously though, you'd think when force-fed a 403 the 'crawler' would either self-update or a 'human' would see and understand what's going on and remove the cause of the 403. In this case: the domain whose Webmaster has said: "no-no-no!".

Here's an MSN query that constantly results in a 301.

http*//search.msn.com/results.asp?RS=CHECKED&FORM=MSNH&v=1&q=whatever+the+search+string+is+it+makes+no+difference

Knock! Knock! Hellllllooooooooooooo MSN! Anyone home?

As you can see below, Ask Jeeves/Teoma needs a little work too.

################# - - [21/Jan/2003:10:43:18 -0800] "GET /1PRezzies.html HTTP/1.0" 301 259 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma)"
################# - - [21/Jan/2003:10:43:33 -0800] "GET /Presidential.html HTTP/1.0" 200 8727 "-" "Mozilla/2.0 (compatible; Ask Jeeves/Teoma)"

Again, week after week, the same thing - endlessly droning on and on but not quite 'learning' the file has been moved.

I've seen this for the last couple of months and I'll continue to see it until they 'educate' bots/spiders thereby empowering them with the ability to 'learn', because obviously (to me) they ain't learnt nuthin yet.

Maybe what we're talking about here is whose responsibility is fixing re-directed links within databases: The Webmaster, or the technology perusing the 'Net?

Given the scope and fluid nature of both the Internet and Internet technology, I tend to think it is technologies responsibility to 'correct' the things I've indicated (by virtue of 301 re-directs) in need of correction.

Otherwise, the Webmaster is forced to meticulously track down all those sources maintaining old files/pathways within their databases.

Oh, and let's not forget Bookmark sites or those who bookmark specific deep-links. Sometimes we don't always validate our bookmarks as often as we should because maybe we just don't get to those sites all that often. Somehow or other, I simply don't see myself notifying every individual browser owner on the Internet that the file/pathway has changed.

Mercator and Google seem to be the only 'smart' ones out there. They've 'updated' their databases shortly after I changed my file/pathways (better than seven months ago) and have been calling for the newer file/pathways ever since.

There is no question the 301 re-direct is the best way to go initially. True, you do have to notify lots and lots of people when you do this. But, then again, I'm still stuck with 'stupid' bots that can't 'read'.

In the Internet World, I'm "smaller than a knick on the neck of a gnat" (Wallace Beery, in a movie circa 1939.) who (according to many) is solely responsible for; individually notifying every potential visitor who may have ever bookmarked my site, every Search Engine and database on the Internet - of the current status of files/pathways that have changed within my domain. I should be able to handle that with 301 re-directs.

I do not, however, feel it is up to me to manually track down all humans and bots/spiders who can't, or for whatever reason doesn't understand the file has moved or the pathway has changed!

My Grandfather used to always say that one should "Learn something new everyday" and that if they didn't they "Just weren't listening".

Pendanticist.

hakre

4:23 pm on Jan 22, 2003 (gmt 0)

whoa pendanticist, thats a big shot. ;)

you're right i have not analyzed that this right, but to tell the truth, it's not a webmasters responsibility if clients connecting to your webserver do not understand the http standard.

i know it's hard when a bird crushes against a window because it things it can flew through but then you put a sticker on it and all problems solved. if a bird crushes into your window again, then this is hard, but i think you won't remove the window out of the room.

weesnich

10:54 pm on Jan 22, 2003 (gmt 0)

Building a nice custom-404-page to collect all those misled human visitors and show them a way to find your content is all you could do.
I think you should not worry about bots - they are just bots and are not supposed to feel frustrated. They will even come back for another 404 next month. :-)

pendanticist

5:02 am on Jan 25, 2003 (gmt 0)

Building a nice custom-404-page to collect all those misled human visitors and show them a way to find your content is all you could do.

Not sure to whom you are posting your response to, but I've had a custom 404 page for many months now.

I think you should not worry about bots - they are just bots and are not supposed to feel frustrated. They will even come back for another 404 next month. :-)

Bold emphasis added by me.

That's exactly my point.

Stupid Bots/Spiders [webmasterworld.com] hog more and more of our bandwidth with each succeeding visit.

Why should the domain owner be forced to pay the cost when we can develop strategies [webmasterworld.com] to stop them before they hog the bandwidth?

One thing is certain - these malicious, bandwidth-hogging spider/bots will only grow in numbers as they continue sucking down all the pages of the Internet with ever increasing voracity.

Pendanticist.