Forum Moderators: Robert Charlton & goodroi
I get a few incoming links with extra punctuation on the end (usually period, occasionally comma, and very occasionally something else) often formed by poorly designed URL auto-linking routines in common forum, blog and CMS software. It is not readily possible to search for those, and so I already have a rule in my .htaccess that sends a 301 redirect to the same URL with the trailing junk stripped off.
Even so, I still get a few links each month that make very little sense, whoever linked was really not paying attention to what they were doing. The duff URLs show in the server logs as a Googlebot (or other bot) access (and therefore *without* any referrer information) and then a few days later appear in the Google WMT 404 report. Very often, that is the only places they show... because no human has clicked the link. I hope that someone eventually clicks on one so I can capture the referrer information, but it often does not happen.
At this point, a Google search for the duff URL occasionally finds the site where the problem link was posted, but this only works if the anchor text is the same as the link URL and the typo does not involve punctuation. There are a great many links that remain impossible to find, mainly because you can't search for stuff in the HREF on a page, so links with wordy anchor text and duff HREF can't be found. Even more important, many of the duff incoming links have weird punctuation on the end and Google just will not return results for a URL search with an underscore or a quote mark on the end.
However... with this new feature, the list of 404 errors is now much more useful. Now that information *can* be found - and very easily. What a great feature!
I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end, so they would fail to a 404 error. I have added a redirect for those on most sites with the problem, but some remain listed in WMT. I now discover that all of the duff links of that type come from Word documents scattered all over the web. Why this is so, I have no idea; but as least I can now look in to it.
Again, this is a great feature. I think people will be extremely shocked as to how many duff links they have pointing at their site and how careless the average netizen is when they cut and paste links. My pet peeve is people who post links with lots of unnecessary parameters in them, including session IDs, and, for Google searches, stuff like &client=Firefox or &client=Opera when I am using something else - and the totally ridiculous &rls=GGGL,GGGL;GGGL:2006-17,GGGL;GGGL:en stuff you see users of Firefox posting without any thought whatsoever.
I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited", but it does now link through to the graphical crawl report.
I use the least important page because I want to know when the bots are doing a deeper crawl than just the home page -- you can of course put the line of exec cgi code at the bottom of the home page or any other page that you want. This sort of program also alerts you to all the other bots that are crawling through your pages, which is useful in htaccess blocking.
...............................