homepage Welcome to WebmasterWorld Guest from 54.197.110.151
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Matching URL with encoded ampersand against robots.txt
aakk9999

WebmasterWorld Administrator 5+ Year Member



 
Msg#: 4026235 posted 2:49 am on Nov 17, 2009 (gmt 0)

Does anybody know if Google, once it picks up URLs from the HTML of the page and puts it in its To Do list, decodes them properly, i.e. replaces & with & before comparing it to robots.txt (and if anything has changed in this area recently)?

In our HTML all ampersands are encoded as & however on occasions (exception rather than a rule) I can see that googlebot is trying to request URL with *&lang= or even &amp%3Blang= instead of translating this into &lang= as it should be.

This then results in spurious URLs being requested and causes errors on the server.

We have stopped this via robots.txt few months ago like this (and this seemed to work fine):

Disallow: /*&lang=
Disallow: /*&amp%3Blang=

On occasion we also see that GWT reports URL as "blocked by robots.txt", although when I copy and paste this URL into robots test window in GWT, it reports this URL as Allowed.

However, in the last two weeks we have noticed in GWT crawling stats that the number of KB downloaded has dropped significantly although the number of pages crawled by googlebot seems to be averaging as before.

Any ideas?

 

goodroi

WebmasterWorld Administrator goodroi us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4026235 posted 12:30 am on Nov 18, 2009 (gmt 0)

As a rule I try whenever possible to avoid ampersands in the url and really avoid multiple ampersands. Ampersands make things more difficult for the search engine bots and the last thing you want is to make it harder for the bots.

When dealing with sites that have ampersands I have found that sometimes the search engine crawling issues is because a website linking to the ampersand page is using the opposite form that the website was setup as.

Don't forget that htaccess might be helpful in rewriting urls requests that arent formatted the way you would prefer.

Im not sure wht the crawling is using less KBs but visiting the same amount of pages. My guess would be that the page might have returned a 304 status code and/or your images were not downloaded. As long as your traffic from the search engines is doing fine I would not be worried.

aakk9999

WebmasterWorld Administrator 5+ Year Member



 
Msg#: 4026235 posted 12:45 am on Nov 18, 2009 (gmt 0)

Thank you on reply. The traffic and rankings are steady.

I agree with you on ampersands, we are getting ready for URL rewrite and came across this whilst doing detailed analysis of parameters in URL the site has.

Unfortunately, no htacces (IIS6, no ISAPI, will have to use bespoke db-based rewrite...)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved