homepage Welcome to WebmasterWorld Guest from 107.21.187.131
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Webmaster Crawl Errors are from Nonsense URL's
hottrout




msg:4331465
 9:55 am on Jun 27, 2011 (gmt 0)

For the past number of months Google Webmaster Tools has been showing increasing numbers of Not Found 404 errors in the crawl errors. It is currently at over 2200 errors. It was at 1200 last week.

The 404's are for URL's on my site that have never existed and the page that they were supposedly linked from is no longer available.

The strange bit is that the URL is made up from valid parts of my site but in a combined incorrect order. Let me explain,

One of the invalid URL's looks like this;

mydomain.com/libraries/radio/libraries/Pictures/gamecovers/images.htm

One part of the URL is correct
mydomain.com/libraries/radio/stations.htm

and the other part is also correct
mydomain.com/libraries/Pictures/gamecovers/images.htm

Google seems to be detecting parts of each url and combining them.

I have no idea how this is being created and I thought that it was just a google glitch. I have requested help on the google webmaster forum several times now to no avail. I would appreciate anyones help with this.

It is worth mentioning that there is quite a detailed htaccess file for the site although I can not see where within the rewrite rules that this could be caused.

[edited by: goodroi at 12:34 pm (utc) on Jun 27, 2011]
[edit reason] Fixed URLs [/edit]

 

deadsea




msg:4331585
 2:42 pm on Jun 27, 2011 (gmt 0)

I sometimes see this type of error from relative internal linking and weird apache default rules.

Take the page:
/libraries/radio/ which has the href "../../Pictures/gamecovers/images.html". Fine.
However, the same page is probably available with extra slashes on the url. Apache allows this by default.
/libraries/radio///
From that url, the relative link resolves to
/libraries/radio/libraries/Pictures/gamecovers/images.htm
Sometimes googlebot or users stumble on extra slash versions of pages. Apache happily serves them up. Navigation can break. Googlebot can get 404s.

lucy24




msg:4331625
 3:54 pm on Jun 27, 2011 (gmt 0)

Are a lot of the spurious addresses pointing to a deeper nesting of directories than you've actually got on your site? Like five deep when the most you've ever got is three? If so you can globally 410 them with a couple of lines in the .htaccess, and google will eventually give up. (With emphasis on the "eventually". A 410 that was previously a 404 seems to get crawled much longer than if you'd given it a 410 in the first place. I counted a random page of mine and they've hit the same 410 at least fifty times.)

g1smd




msg:4331630
 4:03 pm on Jun 27, 2011 (gmt 0)

I agree this is mainly a problem when relative links are followed and misinterpreted. Combined with mod_rewrite rules that don't validate the leading folders in the requested URLs, you can quickly have a crawling nightmare on your hands.

Make sure the site uses linking that BEGINs with a leading slash and make sure your rewrite rules are tightly coded. Avoid ambiguous patterns.

hottrout




msg:4331646
 4:23 pm on Jun 27, 2011 (gmt 0)

"Make sure the site uses linking that BEGINs with a leading slash and make sure your rewrite rules are tightly coded. Avoid ambiguous patterns. "


As a novice at htaccess code could I ask specifically what I should look at? Would it be acceptable to post my htaccess file for you good natured people to check. My knowledge is little which is most likely the problem.

g1smd




msg:4331656
 4:34 pm on Jun 27, 2011 (gmt 0)

There's an Apache forum here where you can ask about specific problems with your code.

hottrout




msg:4331679
 5:17 pm on Jun 27, 2011 (gmt 0)

I have now created a new post in the Apache forum. Thanks for the direction.

MelissaLB




msg:4331681
 5:22 pm on Jun 27, 2011 (gmt 0)

We have been through this same problem a few months ago, still finding some issues though. We eliminated about 10,000 of these urls found in WMT with the use of 301's to the correct pages.

We still havent found our root cause but i have a very strong suspicion it is happening because of a missing forward slash in the beginning of a linking url, as someone previously suggested.

I will mention that fixing these issues did not get us better ranking but it did clear out about 1/3 of our indexed pages from google.

Good Luck!

maximillianos




msg:4331682
 5:25 pm on Jun 27, 2011 (gmt 0)

Run a crawler like LinkSleuth over your site. It may turn up some bugs in your code that is incorrectly building links on pages that you never notice.

This happened to me, and the result was hundreds of thousands of garbage pages being indexed by Google through bad links that were not even visible on the page (no link text) due to a bug.

I thought the same thing as you, it must be a bug with Google, or some other site linking to me wrong... but in the end the errors were in my own backyard!

vphoner




msg:4332330
 1:28 am on Jun 29, 2011 (gmt 0)

I just ran linksleuth and found some errors (strange pages as described above). I have a wordpress site and recently experienced a 99% drop in traffic. I wonder if the two are related. Anyhow the reason this is happening is that the links did not have the full web address only the web page (so it got the address by reference). So somehow wordpress was tacking on bizarre things onto them. So any web urls, you must have the full address on all of them.

vphoner




msg:4332730
 9:04 pm on Jun 29, 2011 (gmt 0)

Anyone seeing a lot of pages that google says is blocked by robots.txt, but is not? Webmaster Tools is reporting several hundred pages are blocked by robots.txt that I have confirmed are not blocked. This is strange. Can this affect your ranking?

g1smd




msg:4332738
 9:12 pm on Jun 29, 2011 (gmt 0)

In WMT I saw that a page was blocked that should not be. I then altered the robots.txt file to unblock it.

Weeks later Google has read the robots.txt many times. Indeed, WMT confirms that the robots.txt file was read only hours ago. The robots.txt manual tester confirms that page is not blocked, but still the WMT crawl report says the page is blocked by robots.txt.

That's not the only WMT issue. Pages that return 301 or 410 are all reported as returning 404. This is shoddy programming by Google.

tedster




msg:4332784
 10:52 pm on Jun 29, 2011 (gmt 0)

I agree with you g1smd. I appreciate that WMT isn't core product for Google and can't have top priority. But I really expected it to be a lot less buggy by now.

lucy24




msg:4332795
 11:20 pm on Jun 29, 2011 (gmt 0)

Google "outsources" its robots.txt handling. That is, instead of hitting robots.txt at the beginning of each visit and acting accordingly, it's got a separate robot that only reads robots.txt, and at some future time it passes the information along to all the other googlebots.*

The "crawl errors" list is pretty much a black mystery anyway. If you've got a small enough site that it all fits on one screen, you can see "detected" dates ranging back over months. And hiding behind the "linked from" pages will be things like sitemaps from 2008, or pages that themselves haven't existed in years. When the "Linked From" column says "unavailable", you know they've hit rock bottom because they're saying "We have no idea why we believe this page exists, but we're going to keep crawling it and posting it as an error anyway."


* Conversely, Bing seems to have a morbid fascination with robots.txt. They read mine more often in a day than they read all other files in a week.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved