homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 39 message thread spans 2 pages: < < 39 ( 1 [2]     
Google Webmaster Tools now shows you where your 404 errors come from!

 9:11 am on Oct 14, 2008 (gmt 0)

I have always tracked down 404 URLs with simple typos in them by doing a Google search, but that only works when the anchor text is the same as the URL in the link.

I get a few incoming links with extra punctuation on the end (usually period, occasionally comma, and very occasionally something else) often formed by poorly designed URL auto-linking routines in common forum, blog and CMS software. It is not readily possible to search for those, and so I already have a rule in my .htaccess that sends a 301 redirect to the same URL with the trailing junk stripped off.

Even so, I still get a few links each month that make very little sense, whoever linked was really not paying attention to what they were doing. The duff URLs show in the server logs as a Googlebot (or other bot) access (and therefore *without* any referrer information) and then a few days later appear in the Google WMT 404 report. Very often, that is the only places they show... because no human has clicked the link. I hope that someone eventually clicks on one so I can capture the referrer information, but it often does not happen.

At this point, a Google search for the duff URL occasionally finds the site where the problem link was posted, but this only works if the anchor text is the same as the link URL and the typo does not involve punctuation. There are a great many links that remain impossible to find, mainly because you can't search for stuff in the HREF on a page, so links with wordy anchor text and duff HREF can't be found. Even more important, many of the duff incoming links have weird punctuation on the end and Google just will not return results for a URL search with an underscore or a quote mark on the end.

However... with this new feature, the list of 404 errors is now much more useful. Now that information *can* be found - and very easily. What a great feature!

I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end, so they would fail to a 404 error. I have added a redirect for those on most sites with the problem, but some remain listed in WMT. I now discover that all of the duff links of that type come from Word documents scattered all over the web. Why this is so, I have no idea; but as least I can now look in to it.

Again, this is a great feature. I think people will be extremely shocked as to how many duff links they have pointing at their site and how careless the average netizen is when they cut and paste links. My pet peeve is people who post links with lots of unnecessary parameters in them, including session IDs, and, for Google searches, stuff like &client=Firefox or &client=Opera when I am using something else - and the totally ridiculous &rls=GGGL,GGGL;GGGL:2006-17,GGGL;GGGL:en stuff you see users of Firefox posting without any thought whatsoever.

I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited", but it does now link through to the graphical crawl report.



 11:10 am on Oct 29, 2008 (gmt 0)

How can I see the last date of Google bot visit to my homepage? Some of you said the server logs? Can someone explain me how to find those server logs? I need a simple (easy - need less time) way to learn the last visit of Google Bot.


 12:45 pm on Oct 29, 2008 (gmt 0)

What I did was to install a free Perl bot detection script that is triggered when any bot visits the least important page on my site. Then I bookmarked the data-log that lists each bot that hits this inconsequential page, so all I have to do every morning is open it in Internet Explorer to see what came by the night before -- takes 4 seconds. I don't think I'm allowed by forum rules to name the script's URL, so I'll just say to look in some of the free Perl archives (there are no doubt PHP and ASP scripts that do the same thing).

I use the least important page because I want to know when the bots are doing a deeper crawl than just the home page -- you can of course put the line of exec cgi code at the bottom of the home page or any other page that you want. This sort of program also alerts you to all the other bots that are crawling through your pages, which is useful in htaccess blocking.



 7:14 pm on Oct 29, 2008 (gmt 0)

thanks Reno. I'll try it.


 7:23 pm on Oct 29, 2008 (gmt 0)

They are inaccurate. Alot of them say are linked from my sitemap. I dug through the sitemap and sure enough the links are NOT there. I see what it is doing though, it is processing a partial URL and not the whole URL so that is giving it a 404.


 7:39 pm on Oct 29, 2008 (gmt 0)

... partial URL. I have seen that a few times on relative URLs.

I rarely see it, because in my own work I always use the full path starting with a / to count from the root.


 8:13 pm on Oct 29, 2008 (gmt 0)


Can you please tell me, is the script that you use in the HTML code or does it have to be in the .htaccess file?


 9:23 pm on Oct 29, 2008 (gmt 0)


You'd install the script in your cgi-bin; then upload a shtml page with a cgi exec to your public_html.

[edited by: tedster at 9:52 pm (utc) on Oct. 29, 2008]


 9:29 pm on Oct 29, 2008 (gmt 0)

Would it be a good idea to 301 redirect the wrong URLs to the correct ones?


 11:44 pm on Oct 29, 2008 (gmt 0)

I don't have access to the root host file. Will this have an effect on uploading the file?

This 39 message thread spans 2 pages: < < 39 ( 1 [2]
Global Options:
 top home search open messages active posts  

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved