homepage Welcome to WebmasterWorld Guest from 54.211.100.183
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google site map issues when URL contains %3A
Google is not properly following site map URLs containing a %3A
KenB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4052903 posted 2:49 pm on Jan 2, 2010 (gmt 0)

I've discovered that Google's bots seem to be unable to properly follow URLs that encode colons (':') as %3A. It seems that Google insists on replacing the '%3A' with a ':' before following the link. This creates concatenation problems as I'm found other cases where links could not be properly followed by others if they use a ':' instead of '%3A'.

For example if the urlencoded URL http://example.com/foo/widgets%3A%20blue.html was in a site map, Google would follow it as http://example.com/foo/widgets:%20blue.html.

In order to prevent duplicate content penalties and in an effort to try and concatenate all pages to a single page I had coded a 301 redirect from URI requests containing ':' to URIs using '%3A'. This threw Google into a circular redirect as Google's bot would still make its request using ':'.

My method for dealing with this issue has been to stop redirecting requests with URIs containing ':' to URIs using '%3A' and instead using the following in my HTML header:
<link rel="canonical" href="http://example.com/foo/widgets%3A%20blue.html">
Where the URL above is replaced with a properly urlencoded reference for the page in question.

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4052903 posted 5:35 pm on Jan 2, 2010 (gmt 0)

nice tip and a good brainteaser for the next pubcon :)

KenB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4052903 posted 6:08 pm on Jan 2, 2010 (gmt 0)

Good brain teaser for Pubcon maybe, but right now Webmaster tools (WMT) is throwing 350 not followed followed errors due to googlebot's inability to follow my redirects. :(

Now I have to wait a week or two for those errors to clear themselves out of my WMT crawl error log. grrrr....

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4052903 posted 6:48 pm on Jan 2, 2010 (gmt 0)

The take-home lesson here is to avoid "special characters" in the URL-path-part of a URL. You can use these special characters more freely in the query string part (always-encoded if need be), but the URL-path-part has many more restrictions [tools.ietf.org].

Like it or not we are *not* free to use 'just any' characters anywhere we want in URIs.

Jim

KenB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4052903 posted 8:28 pm on Jan 2, 2010 (gmt 0)

This is a really old part of my site. Basically it is using some rewrite rules to convert the URL into a query string. For instance behind the scenes, http://example.com/foo/widgets%3A%20blue.html is handled as http://example.com/index.html?foo=widgets%3A%20blue

It is actually a chemical database, with each chemical name being the "filename". This lead to some really messed up file names, but like I said this section of my website is almost ten years old so there is a limit to what I can do to fix things without taking some serious SERP hits.

Remember way back when I added this section of my site, query strings weren't treated as nicely by search engines as were "real" web pages so it was important to rewrite queries into the main URI. Even today there is debate whether one should rewrite queries into the main URI.

eeek

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4052903 posted 3:51 am on Jan 8, 2010 (gmt 0)

Basically it is using some rewrite rules to convert the URL into a query string

Have you considered using pathinfo instead of a query string?

KenB

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4052903 posted 4:07 am on Jan 8, 2010 (gmt 0)

Pathinfo wouldn't resolve my stupid encoded character issue and the query strings work just fine. They are hidden from users since I'm using RewriteRule and any changes to the design of the site in this matter would be drastic beyond measure.

As they say, don't fix what ain't broke. The stupid %3A issue was a break that had to be fixed, but that wasn't a RewriteRule problem.

speedshopping

5+ Year Member



 
Msg#: 4052903 posted 5:30 pm on Feb 21, 2010 (gmt 0)

Hi,

We had 110,000 keywords that looked like this:

www.domain.com/keyword A/

You will notice it has a space within the keyword part of the URL.

After noticing recently in WMT that Google had started to have problems with the spaces (although it had managed to index 70,000) we changed the format to:

www.domain.com/keyword%20A/

i.e. replacing a space with a %20

Since doing this we have noticed in WMT that in just over 72 hours the indexed URLs has gone from 79,000 to 30,000!

Is Google treating the URLs as being different?

Can anyone help us?

Regards,
Wesiwyg

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4052903 posted 10:36 pm on Feb 21, 2010 (gmt 0)

Spaces in URLs is just asking for trouble. Ditto certain symbols. Use hyphens or underlines.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved