Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawl errors for rewritten URLs

         

speedshopping

10:19 pm on Jan 6, 2011 (gmt 0)

10+ Year Member



We have a .net site that rewrites it's urls from .aspx to friendly urls. Whilst looking through our log files we have noticed that googlebot is sometimes returning a 200 0 0 response by crawling the .aspx raw URL as it should, but other times the bot tries to crawl the rewritten friendly URL and returns a 404 11 0... Can anyone shed any light on why this is happening and how to fix this issue - we believe it's the cause of major traffic loss which happened after the 29th dec.

Thanks for your help!

tedster

2:26 am on Jan 8, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



googlebot is sometimes returning a 200 0 0 response by crawling the .aspx raw URL

First, you do mean that your server returns a 200 OK to googlebot, correct? And why do you say that "should be" the case? How are you rewriting the URL, technically?

Another couple questions: Where is Google finding the native, not-rewritten URL - is your site using it in your internal links somewhere?

Also, does the rewritten URL resolve 200 OK when you request it directly with a browser?

speedshopping

10:14 pm on Jan 8, 2011 (gmt 0)

10+ Year Member



As far as I know Googlebot should never display the re-written URL in the log file (which it is doing) - all internal links are always rewritten versions - one thing to add is that we have %20 in our rewritten URLs (or spaces)

Raw URL - search.aspx?search=keyword keyword

Rewritten URL - /s/keyword keyword/a/blah/

Googlebot converts to /s/keyword%20keyword/a/blah/

In the log file I can see 200 responses from Googlebot and the URL shows as search.aspx?search=keyword keyword, then other times we are seeing a 404 11 error (we think this is down to Google image search which is trying to encode the %20 and %2520 in the bottom frame)

aakk9999

5:42 am on Jan 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



In the log file I can see 200 responses from Googlebot and the URL shows as search.aspx?search=keyword keyword

Well, I think you have (at least) two problems:

1) If your rewrite module works properly, then a request for the dynamic (not-rewritten URL) should return HTTP response 301 permanent redirect to friendly URL (and not 200 OK).

So, if your dynamic URL is search.aspx?search=keyword keyword which is set up as a friendly URL /s/keyword keyword/a/blah/ then:
- requesting friendly /s/keyword keyword/a/blah/ should return 200 OK
- requesting dynamic search.aspx?search=keyword keyword should return 301 redirect to the above friendly URL

2) The second problem is why Google may be finding your dynamic (raw) URLs when you say that all internal links on the site are in "friendly" version.

a) One possibility is that Google knew of your raw URLs from before (i.e. before you implemented rewrite). Google has a very long memory and will periodically request old URLs even if they are not any more referenced within your site. This is why you should implement 301 redirect mentioned above.

If these dynamic URLs were not known to Google from before, then there are another three possibilities on why Google may be finding dynamic URLs.

b) When posting your aspNet form, the "action=" still uses dynamic URL when posting back the page. Ideally, this should be replaced by friendly URL.

c) You have somewhere on the page javascript that creates dynamic URL and Google is able to understand it

d) And lastly, a very weird case, which I came across recently (and also in .NET environment with a custom rewrite module) - you may have a problem depending on how you are replacing dynamic URLs with friendly URLs when you are generating your page HTML.

Ideally, you should be replacing each link as you are outputting a line of each HTML code.

However, I know of case where the page HTML was generated by the script on the server and then before the content was sent back, all links on the page were replaced to friendly URL by using regular expressions. Since the page is sent back to the agent in chunks, i.e. in more than one buffer, on some occasions this caused dynamic URLs to "split" across two consecutive buffers, i.e. URL would span two buffers (first part of URL was at the end of one buffer and the second part of URL was at the start of the next buffer). Such URLs would not get replaced with friendly as regex would not find the match for such "split" URLs.

This situation is more difficult to detect because whilst on occasions when looking at your browser and hover over links, you could see that there is a dynamic URL on the page, refreshing the page again would "fix" the problem. The "fix" was because of .NET "__VIEWSTATE" variable, whose content is not always the same size between two requests of the same page and therefore the "breakdown" of buffer does not always end up on the same place.

One way of detecting whether your pages have any of dynamic URLs that "slip through" is using Xenu linksleuth or similar URL checking program against your site. If you suffer from a problem similar to the above then there will be at least some dynamic URLs that you will see in these results.

jdMorgan

5:20 pm on Jan 9, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Another possibility for the 'exposure' of internal filepaths (equivalent here to the old dynamic URLs) as URLs to HTTP clients is an error in rule order.

Be sure that across all config files at all levels (taken as a whole), all external redirects are encounterd and executed before any internal rewrites. Otherwise, if for example an internal URL-to-filepath rewrite occurs, but is followed by a domain canonicalization redirect, then the redirect will expose that previously-rewritten internal filepath as a URL.

Jim

Jonny6

12:04 am on Jan 11, 2011 (gmt 0)

10+ Year Member



Google does not return the header response codes, your server does, deal with that, including the internal links to the friendly urls, as already perfectly explained above.