| 2:42 pm on Jun 27, 2011 (gmt 0)|
I sometimes see this type of error from relative internal linking and weird apache default rules.
Take the page:
/libraries/radio/ which has the href "../../Pictures/gamecovers/images.html". Fine.
However, the same page is probably available with extra slashes on the url. Apache allows this by default.
From that url, the relative link resolves to
Sometimes googlebot or users stumble on extra slash versions of pages. Apache happily serves them up. Navigation can break. Googlebot can get 404s.
| 3:54 pm on Jun 27, 2011 (gmt 0)|
Are a lot of the spurious addresses pointing to a deeper nesting of directories than you've actually got on your site? Like five deep when the most you've ever got is three? If so you can globally 410 them with a couple of lines in the .htaccess, and google will eventually give up. (With emphasis on the "eventually". A 410 that was previously a 404 seems to get crawled much longer than if you'd given it a 410 in the first place. I counted a random page of mine and they've hit the same 410 at least fifty times.)
| 4:03 pm on Jun 27, 2011 (gmt 0)|
I agree this is mainly a problem when relative links are followed and misinterpreted. Combined with mod_rewrite rules that don't validate the leading folders in the requested URLs, you can quickly have a crawling nightmare on your hands.
Make sure the site uses linking that BEGINs with a leading slash and make sure your rewrite rules are tightly coded. Avoid ambiguous patterns.
| 4:23 pm on Jun 27, 2011 (gmt 0)|
"Make sure the site uses linking that BEGINs with a leading slash and make sure your rewrite rules are tightly coded. Avoid ambiguous patterns. "
As a novice at htaccess code could I ask specifically what I should look at? Would it be acceptable to post my htaccess file for you good natured people to check. My knowledge is little which is most likely the problem.
| 4:34 pm on Jun 27, 2011 (gmt 0)|
There's an Apache forum here where you can ask about specific problems with your code.
| 5:17 pm on Jun 27, 2011 (gmt 0)|
I have now created a new post in the Apache forum. Thanks for the direction.
| 5:22 pm on Jun 27, 2011 (gmt 0)|
We have been through this same problem a few months ago, still finding some issues though. We eliminated about 10,000 of these urls found in WMT with the use of 301's to the correct pages.
We still havent found our root cause but i have a very strong suspicion it is happening because of a missing forward slash in the beginning of a linking url, as someone previously suggested.
I will mention that fixing these issues did not get us better ranking but it did clear out about 1/3 of our indexed pages from google.
| 5:25 pm on Jun 27, 2011 (gmt 0)|
Run a crawler like LinkSleuth over your site. It may turn up some bugs in your code that is incorrectly building links on pages that you never notice.
This happened to me, and the result was hundreds of thousands of garbage pages being indexed by Google through bad links that were not even visible on the page (no link text) due to a bug.
I thought the same thing as you, it must be a bug with Google, or some other site linking to me wrong... but in the end the errors were in my own backyard!
| 1:28 am on Jun 29, 2011 (gmt 0)|
I just ran linksleuth and found some errors (strange pages as described above). I have a wordpress site and recently experienced a 99% drop in traffic. I wonder if the two are related. Anyhow the reason this is happening is that the links did not have the full web address only the web page (so it got the address by reference). So somehow wordpress was tacking on bizarre things onto them. So any web urls, you must have the full address on all of them.
| 9:04 pm on Jun 29, 2011 (gmt 0)|
Anyone seeing a lot of pages that google says is blocked by robots.txt, but is not? Webmaster Tools is reporting several hundred pages are blocked by robots.txt that I have confirmed are not blocked. This is strange. Can this affect your ranking?
| 9:12 pm on Jun 29, 2011 (gmt 0)|
In WMT I saw that a page was blocked that should not be. I then altered the robots.txt file to unblock it.
Weeks later Google has read the robots.txt many times. Indeed, WMT confirms that the robots.txt file was read only hours ago. The robots.txt manual tester confirms that page is not blocked, but still the WMT crawl report says the page is blocked by robots.txt.
That's not the only WMT issue. Pages that return 301 or 410 are all reported as returning 404. This is shoddy programming by Google.
| 10:52 pm on Jun 29, 2011 (gmt 0)|
I agree with you g1smd. I appreciate that WMT isn't core product for Google and can't have top priority. But I really expected it to be a lot less buggy by now.
| 11:20 pm on Jun 29, 2011 (gmt 0)|
Google "outsources" its robots.txt handling. That is, instead of hitting robots.txt at the beginning of each visit and acting accordingly, it's got a separate robot that only reads robots.txt, and at some future time it passes the information along to all the other googlebots.*
The "crawl errors" list is pretty much a black mystery anyway. If you've got a small enough site that it all fits on one screen, you can see "detected" dates ranging back over months. And hiding behind the "linked from" pages will be things like sitemaps from 2008, or pages that themselves haven't existed in years. When the "Linked From" column says "unavailable", you know they've hit rock bottom because they're saying "We have no idea why we believe this page exists, but we're going to keep crawling it and posting it as an error anyway."
* Conversely, Bing seems to have a morbid fascination with robots.txt. They read mine more often in a day than they read all other files in a week.