Googlebot: ">" at the end of the URL causing 301s

Forum Moderators: open

Message Too Old, No Replies

Googlebot: ">" at the end of the URL causing 301s

luma

2:48 am on Aug 5, 2002 (gmt 0)

Googlebot visited on 04/Aug/2002 and caused the following two entries in my logs:

"GET /widgets/bue_widgets/index.html%3E HTTP/1.0" 301 257
"GET /widgets/%3E HTTP/1.0" 301 234

"%3E" translates to the greater than sign ">" (ASCII char 62). afaik, this is not a valid character in an URL. Why is Googlebot trying to access something that obviously can't exist?

Both were redirected to the correct files. These files are already indexed. Shouldn't the crawlers be smart enough to discard the ">". I sometimes write URL like this <http://www.domain.com/widgets/> in emails or newsgroup postings and I never considered this a problem.

Did anyone else notice this?

mbauser2

3:14 am on Aug 5, 2002 (gmt 0)

Googlebot visited on 04/Aug/2002 and caused the following two entries in my logs:
"GET /widgets/bue_widgets/index.html%3E HTTP/1.0" 301 257
"GET /widgets/%3E HTTP/1.0" 301 234
"%3E" translates to the greater than sign ">" (ASCII char 62). afaik, this is not a valid character in an URL. Why is Googlebot trying to access something that obviously can't exist?

Hold up, there. 'illegal character in a URL', 'illegal character in a file system', and "can't exist" are three different concepts:

1) There might actually be a filesystem that allows ">" in file names, although that would be silly. None the less, a URL containing ">" could exist somewhere.

2) Good search engine robots do their best to convert URL-illegal characters to legal ones, because engines don't want to throw out pages for trivial mistakes. Remember, "~" isn't even URL-legal, and it's been used in millions of URLs.

3) "%3E" is URL-legal, and any URL using it is technically fine. That's why there's a conversion scheme in the URL spec: So illegal URLs can be made legal.

I sometimes write URL like this <http://www.domain.com/widgets/> in emails or newsgroup postings and I never considered this a problem.

I doubt Googlebot is grabbing those bad URLs from newsgroup postings. It's probably getting them from somebody who's screwed up a hyperlink to you somewhere on the Web by cutting-and-pasting from a newsgroup or e-mail message. If that's the case, Googlebot did the right thing here, and converted the URL according to spec.

jdMorgan

3:19 am on Aug 5, 2002 (gmt 0)

luma,

Have you tried a search on Google (or other SEs) to see who links to those /> URLs yet?
Might be worth a shot. Maybe you can get them corrected at the source of the problem.

Jim

Hemsell

10:16 am on Aug 5, 2002 (gmt 0)

(edited)After re-reading your post I notice that you were not trying to make an a tag (doh) perhaps Googlebot is just that forgiving in thinking you were?

To the best of my knowledge you need a space between the stop and your url if you want to end it without a seperate </> command
like <a href=blah.com/ />
or <p />

I read that here [w3schools.com...]
yes, the page does refer to xhtml, but xhtml is a super strict version of html. same rules apply though

(below is pure speculation)
Googlebot see it as part of your unclosed <a tag
best bet would be an <a href="gggggg/"></a>
second best would be <a href="gggggggg/" />

Though I see no point in a <a> tag that does not "turn something on"
like <a href="ggggg/">Text<a/>

ciml

11:30 am on Aug 5, 2002 (gmt 0)

I agree with Jim, Google probably found that link. I wonder if someone linked to you by using a WYSIWYG type HTML generator that escaped his ">" character for him?

It might not be easy to track it down. Hopefully Google will continue to list the correct addresses for your resources.

luma

6:02 pm on Aug 5, 2002 (gmt 0)

Okay, found it. :) It was a posting (not from me) to a php based news board on May 21st. I'll contact and ask them to change the links. btw, the link looks like this:

&lt;<a href="http://www.domain.com/widgets/index.html&gt">
http*//www.domain.com/widgets/index.html&gt</a>;

The BB software splitted the > and put &gt inside the anchor and the semi-colon after it. Oh well, ...

Thanks for your comments and my appologies to Googlebot [googlebot.com].

PS: http*// stands for http://

ciml

6:15 pm on Aug 5, 2002 (gmt 0)

Well done luma, it can be hard to find something like that.

Hopefully there won't be a problem with that listing in the next update, at least if there is you know it'll be fixed in the next.