|Crawl Errors in Google Webmaster Tools|
I am having a crawl error issue with Google Webmaster Tools. I have researched, tested, etc. etc. and I can not find the problem. It is reporting over 1000 crawl errors and climbing.
Google Webmaster Tools reports tons of pages that do not exist due to a wrong path. When I go to the page listed as having the bad link everything looks correct. I even went as far as making all the links on the entire site absolute links(full URL), and I still get the errors. Here is what it's reporting:
Pages like this that do not exist(wrong path).
The proper directory for that file is:
So it seems to be linking to it like it's a relative path because here is the URL that is reporting to have the bad link:
All the links on the page above(page that allegedly has the bad links) are absolute path links. I am at a loss.
The other possibility is all my pages run through index.php with GET variables directing the content, so I have rewrite rules in place for more user friendly URLs. Could my rewrite rules be the cause of this? Here are my htaccess rewrite rules:
RewriteRule ^([^/]*)\.html$ /index.php?content=$1 [L]
RewriteRule ^([^/]*)/([^/]*)\.html$ /index.php?content=$1&manufacturer=$2 [L]
RewriteRule ^([^/]*)/([^/]*)/machine%20shop/\.html$ /index.php?content=$1&service=$2 [L]
RewriteRule ^([^/]*)/([^/]*)/([^/]*)\.html$ /index.php?content=$1&product=$2&service=$3 [L]
RewriteRule ^([^/]*)/([^/]*)/([^/]*)/([^/]*)\.html$ /index.php?content=$1&MID=$2&category=$3&manufacturer=$4 [L]
RewriteRule ^([^/]*)/([^/]*)/([^/]*)/([^/]*)/([^/]*)\.html$ /index.php?content=$1&part=$2&category=$3&MID=$4&manufacturer=$5 [L]
Thank you guys in advance for any advice/assistance on this issue. Even though crawl errors don't play too big of a role in a site's ranking and performance, this seems like something is wrong and needs to be fixed.
[edited by: Robert_Charlton at 7:51 pm (utc) on Oct 12, 2012]
[edit reason] examplified domain [/edit]
Mods will soon remove URL in your post - you should have used example.com.
And yes, mods have removed URL whilst I was typing the response...
Anyway, with regards to your question:
Firstly, it would be good if you crawl your site with a tool such as Xenu's Link Sleuth and then inspect the crawl results to confirm that there is indeed no such link on your site.
It is also possible that link existed in the past and that you fixed it since, but Google is still requesting it. If Google has not re-crawled the page that had incorrect link (the page that you fixed), it will think the link is still there out on the web and from my experience these links get crawled more often and reported as errors in WMT.
And then it is also possible that someone linked to that page whilst the link was not in correct format and therefore such link exist somewhere else on the web - and Google will be re-trying it and reporting the error.
Once you fix the error (which appears you have), and the fixed page has been re-crawled by google, then I would declare WMT error "Fixed". From my experience, once the page that returns 404 is not linked from anywhere, and you cleared error in WMT via "Fixed", then the error for this URL will not appear again in WMT error report (this may take a while and may need a several cycles of declaring URL "Fixed"). But eventually Gooogle will drop this error from WMT errors report.
It is also worth knowing that if there is another page (on your domain or on external domain) that links to your page using that old incorrect URL format, then the 404 error will re-appear in WMT even if you declare it as "Fixed" (in which case you will probably see which page links to it, i.e. where did Google found it).
However, if the error reported in WMT is 404, I would not worry as long as your site is not linking to such URL. The 404 report in WMT is useful exactly for this reason - to check whether your own site is inadvertly linking to incorrect URL format. Otherwise, just ignore the error.
If internal links begin with a leading slash (or begin with protocol and hostname) then you have "fixed" the problem.
Google will request the duff URLs forever. Make sure they return 404 and then move on. The frequency of requests will diminish over time.
^([^/]*)/([^/]*)/([^/]*)/([^/]*)\.html$ allows a request for
example.com////.html to be considered valid and be rewritten (with blank parameter values passed to the PHP script). The
* should be replaced with a
+ in each case.
[^/] in each rule should be
[^/.] as you're looking to stop at the period before the extension, no longer looking only for folder slashes.
Thank you guys very much. I think the consensus is to wait it out and see if they diminish. All the links are absolute, verified through viewing the source, so we will see! Also thank you for the mod rewrite advice. I made the changes.
You guys have always rocked! :D
This also looks problematical...
Not only the space, but the slash immediately before the .html extension.