Forum Moderators: Robert Charlton & goodroi
I get a few incoming links with extra punctuation on the end (usually period, occasionally comma, and very occasionally something else) often formed by poorly designed URL auto-linking routines in common forum, blog and CMS software. It is not readily possible to search for those, and so I already have a rule in my .htaccess that sends a 301 redirect to the same URL with the trailing junk stripped off.
Even so, I still get a few links each month that make very little sense, whoever linked was really not paying attention to what they were doing. The duff URLs show in the server logs as a Googlebot (or other bot) access (and therefore *without* any referrer information) and then a few days later appear in the Google WMT 404 report. Very often, that is the only places they show... because no human has clicked the link. I hope that someone eventually clicks on one so I can capture the referrer information, but it often does not happen.
At this point, a Google search for the duff URL occasionally finds the site where the problem link was posted, but this only works if the anchor text is the same as the link URL and the typo does not involve punctuation. There are a great many links that remain impossible to find, mainly because you can't search for stuff in the HREF on a page, so links with wordy anchor text and duff HREF can't be found. Even more important, many of the duff incoming links have weird punctuation on the end and Google just will not return results for a URL search with an underscore or a quote mark on the end.
However... with this new feature, the list of 404 errors is now much more useful. Now that information *can* be found - and very easily. What a great feature!
I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end, so they would fail to a 404 error. I have added a redirect for those on most sites with the problem, but some remain listed in WMT. I now discover that all of the duff links of that type come from Word documents scattered all over the web. Why this is so, I have no idea; but as least I can now look in to it.
Again, this is a great feature. I think people will be extremely shocked as to how many duff links they have pointing at their site and how careless the average netizen is when they cut and paste links. My pet peeve is people who post links with lots of unnecessary parameters in them, including session IDs, and, for Google searches, stuff like &client=Firefox or &client=Opera when I am using something else - and the totally ridiculous &rls=GGGL,GGGL;GGGL:2006-17,GGGL;GGGL:en stuff you see users of Firefox posting without any thought whatsoever.
I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited", but it does now link through to the graphical crawl report.
I noticed Google reporting a broken link of the form www.example.com/gi,
Now I can see where Google get the error fron, I can see that Google is actually extracting this "link" from a javascript regex replace as below:
value.replace(/"/gi,""")
Can you elaborate on that?
Yeh - Google lists as a broken link www.example.com/gi, - I never tracked this down via spidering myself. Google now lists in WMT the linking page as a page on the site itself, that doesn't link, but merely has a reference as below:
value.replace(/[3][b]"/gi,"[/b][/3]"")
So, my assumption is that in addition to looking for references to http:// to discover URLs, it is also looking at references within quotes that start with a forward slash - in this instance "/gi,". Simple pattern matching, that has gone wrong in this instance. There was nothing prefixing the reference googlebot picked up on like an href or src attribute that implied it was a URL - I believe it actually confused regex delimiters with a file path.
[edited by: Receptional_Andy at 12:54 pm (utc) on Oct. 14, 2008]
I have a number of pages where the "Linked From" column is "Unavailable".
It would be interesting to know where these URLs have come from, if they are not linked from another page.
Does it mean Google no longer has the original linking page in its index or is it more complex than that ?
Perhaps the URL was manually submitted or maybe Google sees a pattern in the URLs on your site and tries to guess some additional URLs.
I recently had a message via webmaster tools saying Google had found an excessive number of URLs on my site. Upon investigation this had been caused by a bug in one of my scripts which caused it to generate an infinite number of pages.
I fixed the bug and I now generate 404s for these URLs so they now appear in the WMT 404 report.
The interesting thing is the seed pages which started the infinite page creation all show up in WMT as Linked From - Unavailable.
And more interestingly they bear more than a passing resemblence to the pages noted by g1smd in his original post.
I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end
My duff pages take the form
www.mysite.com/validpage_.html where www.mysite.com/validpage.html exists.
Further I have a number of pages on my site which take the form www.mysite/dir/yyyy.html where yyyy is numeric.
WMT also shows some 404s with no linking page for www.mysite/dir/zzzz.html where zzzz is a random number.
Is Google guessing at pages on my site ?
I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited"...
....................
[edited by: tedster at 6:26 pm (utc) on Oct. 14, 2008]
They never showed that. The date was for only the Home Page last visit. The crawl graphs give a much better idea - but the best data of all is in your server logs.
[edited by: tedster at 6:27 pm (utc) on Oct. 14, 2008]
Does it mean Google no longer has the original linking page in its index or is it more complex than that ?
Is Google guessing at pages on my site ?
For instance, I used to see the link reports update some 2 to 3 days after the "Home Page last visited" date was updated - and the reported date for the last Home Page Visit, at the moment the date was updated, always reported as being some 2 to 5 days (sometimes more) ago.
Most of the reports only update every few days, maybe only once per week, and the period varies, but the date on the Content Analysis report changes every day if there is nothing to report. If there is something to report, then the date changes only once every few days, or maybe only weekly, and again the period randomly changes.
But some are unique typo problems, so I simply 301 it to the proper page, etc. It will of course work if someone else does the same typo, but it is unlikely in a few of these cases.
This has been a great find. I've now fixed close to 50 legit inbound links. Granted I have thousands and thousands of total inbound links, but hey, every link counts!
I've opened up the page online and hit view source in the browser. Did a control-F to find the non-existing page that it was supposedly linking to and got zero results.
I've also done a whole site search with Dreamweaver looking for a document that it says I'm linking to - got nothing. There haven't been any changes since their reported date.
Is anyone having similar problems?
Great find. Thanks for sharing, I've been wanting this feature for a while and never understood why it wasn't available.
Ditto.
About time this was added. I kept seeing 404s that I knew were not from the site but elsewhere.
One really annoying thing about this tool is when you press the "OK" button. The page flips back to the top of the page, forcing you to scroll down and find where you last were.
This is a phenomenon I observed quite often in GWTs diagnostics the past years.
I know that google recommends to use absolute hrefs wherever possible, and I am continuously trying to improve my site accordingly. But this is not in accordance with the general W3C-standards.
For instance, years ago, when modem speed was still quite common, I used to burn my website (html-pages) on CD and send it to some customers substituting a paper-catalogue. A weird idea nowadays, I know, but in those days it made sense. That's where those relative links (../../../widgets.html)came from.
With all the things you reported on googlebot following java-script links, I must say the engineers simply got some priorities wrong.
The tool in itself might of course be interesting and helpful, but what's the use of diagnostics from a doctor making fare more mistakes than I myself? The negative mental implications of google telling me "you have a really dirty website, naughty, naughty" are quite frustrating, particularly if it simply isn't correct (or outdated).
So, normally I don't look at this section.
A couple of months ago, I created a redirect in my .htaccess to redirect all index.html pages to the folder root. The code is below and live headers show that the code is working.
RewriteEngine on
#
# Redirect requests for index.html in any directory to "/" in the same directory
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]
#
I only use .html for page extensions. It appears that Google Webmaster Tools is trying to find index.htm links. (?) It says that the source is from my own page - but there is no such link on the source page. Is there a way to force both an index.html AND index.htm redirect to the root folder with the rewrite code?
[edited by: tedster at 3:35 am (utc) on Oct. 22, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]