Forum Moderators: open
"GET /widgets/bue_widgets/index.html%3E HTTP/1.0" 301 257
"GET /widgets/%3E HTTP/1.0" 301 234 Both were redirected to the correct files. These files are already indexed. Shouldn't the crawlers be smart enough to discard the ">". I sometimes write URL like this <http://www.domain.com/widgets/> in emails or newsgroup postings and I never considered this a problem.
Did anyone else notice this?
Googlebot visited on 04/Aug/2002 and caused the following two entries in my logs:
"GET /widgets/bue_widgets/index.html%3E HTTP/1.0" 301 257
"GET /widgets/%3E HTTP/1.0" 301 234
"%3E" translates to the greater than sign ">" (ASCII char 62). afaik, this is not a valid character in an URL. Why is Googlebot trying to access something that obviously can't exist?
Hold up, there. 'illegal character in a URL', 'illegal character in a file system', and "can't exist" are three different concepts:
1) There might actually be a filesystem that allows ">" in file names, although that would be silly. None the less, a URL containing ">" could exist somewhere.
2) Good search engine robots do their best to convert URL-illegal characters to legal ones, because engines don't want to throw out pages for trivial mistakes. Remember, "~" isn't even URL-legal, and it's been used in millions of URLs.
3) "%3E" is URL-legal, and any URL using it is technically fine. That's why there's a conversion scheme in the URL spec: So illegal URLs can be made legal.
I sometimes write URL like this <http://www.domain.com/widgets/> in emails or newsgroup postings and I never considered this a problem.
I doubt Googlebot is grabbing those bad URLs from newsgroup postings. It's probably getting them from somebody who's screwed up a hyperlink to you somewhere on the Web by cutting-and-pasting from a newsgroup or e-mail message. If that's the case, Googlebot did the right thing here, and converted the URL according to spec.
To the best of my knowledge you need a space between the stop and your url if you want to end it without a seperate </> command
like <a href=blah.com/ />
or <p />
I read that here [w3schools.com...]
yes, the page does refer to xhtml, but xhtml is a super strict version of html. same rules apply though
(below is pure speculation)
Googlebot see it as part of your unclosed <a tag
best bet would be an <a href="gggggg/"></a>
second best would be <a href="gggggggg/" />
Though I see no point in a <a> tag that does not "turn something on"
like <a href="ggggg/">Text<a/>
<<a href="http://www.domain.com/widgets/index.html>">
http*//www.domain.com/widgets/index.html></a>; The BB software splitted the > and put > inside the anchor and the semi-colon after it. Oh well, ...
Thanks for your comments and my appologies to Googlebot [googlebot.com].
PS: http*// stands for http://