homepage Welcome to WebmasterWorld Guest from 54.234.147.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 39 message thread spans 2 pages: 39 ( [1] 2 > >     
Google Webmaster Tools now shows you where your 404 errors come from!
g1smd




msg:3765208
 9:11 am on Oct 14, 2008 (gmt 0)

I have always tracked down 404 URLs with simple typos in them by doing a Google search, but that only works when the anchor text is the same as the URL in the link.

I get a few incoming links with extra punctuation on the end (usually period, occasionally comma, and very occasionally something else) often formed by poorly designed URL auto-linking routines in common forum, blog and CMS software. It is not readily possible to search for those, and so I already have a rule in my .htaccess that sends a 301 redirect to the same URL with the trailing junk stripped off.

Even so, I still get a few links each month that make very little sense, whoever linked was really not paying attention to what they were doing. The duff URLs show in the server logs as a Googlebot (or other bot) access (and therefore *without* any referrer information) and then a few days later appear in the Google WMT 404 report. Very often, that is the only places they show... because no human has clicked the link. I hope that someone eventually clicks on one so I can capture the referrer information, but it often does not happen.

At this point, a Google search for the duff URL occasionally finds the site where the problem link was posted, but this only works if the anchor text is the same as the link URL and the typo does not involve punctuation. There are a great many links that remain impossible to find, mainly because you can't search for stuff in the HREF on a page, so links with wordy anchor text and duff HREF can't be found. Even more important, many of the duff incoming links have weird punctuation on the end and Google just will not return results for a URL search with an underscore or a quote mark on the end.

However... with this new feature, the list of 404 errors is now much more useful. Now that information *can* be found - and very easily. What a great feature!

I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end, so they would fail to a 404 error. I have added a redirect for those on most sites with the problem, but some remain listed in WMT. I now discover that all of the duff links of that type come from Word documents scattered all over the web. Why this is so, I have no idea; but as least I can now look in to it.

Again, this is a great feature. I think people will be extremely shocked as to how many duff links they have pointing at their site and how careless the average netizen is when they cut and paste links. My pet peeve is people who post links with lots of unnecessary parameters in them, including session IDs, and, for Google searches, stuff like &client=Firefox or &client=Opera when I am using something else - and the totally ridiculous &rls=GGGL,GGGL;GGGL:2006-17,GGGL;GGGL:en stuff you see users of Firefox posting without any thought whatsoever.

I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited", but it does now link through to the graphical crawl report.

 

Receptional Andy




msg:3765215
 9:27 am on Oct 14, 2008 (gmt 0)

Interesting, and definitely useful.

I noticed Google reporting a broken link of the form www.example.com/gi,

Now I can see where Google get the error fron, I can see that Google is actually extracting this "link" from a javascript regex replace as below:

value.replace(/"/gi,""")

g1smd




msg:3765216
 9:30 am on Oct 14, 2008 (gmt 0)

Ah, so not a link from another site at all, but a "phantom of the bot".

Receptional Andy




msg:3765225
 9:41 am on Oct 14, 2008 (gmt 0)

Yeh - it implies that Googlebot will add stuff like "/example" to the crawl list - even without a href or obvious URL.

centime




msg:3765274
 11:39 am on Oct 14, 2008 (gmt 0)

I wish wmt had a dedicated bug reporting, or feed back function,

Marcia




msg:3765323
 12:42 pm on Oct 14, 2008 (gmt 0)

Yeh - it implies that Googlebot will add stuff like "/example" to the crawl list - even without a href or obvious URL.

Can you elaborate on that?

Receptional Andy




msg:3765327
 12:53 pm on Oct 14, 2008 (gmt 0)

Can you elaborate on that?

Yeh - Google lists as a broken link www.example.com/gi, - I never tracked this down via spidering myself. Google now lists in WMT the linking page as a page on the site itself, that doesn't link, but merely has a reference as below:

value.replace(/[3][b]"/gi,"[/b][/3]"")

So, my assumption is that in addition to looking for references to http:// to discover URLs, it is also looking at references within quotes that start with a forward slash - in this instance "/gi,". Simple pattern matching, that has gone wrong in this instance. There was nothing prefixing the reference googlebot picked up on like an href or src attribute that implied it was a URL - I believe it actually confused regex delimiters with a file path.

[edited by: Receptional_Andy at 12:54 pm (utc) on Oct. 14, 2008]

g1smd




msg:3765343
 1:07 pm on Oct 14, 2008 (gmt 0)

Earlier in the year, I had some duff URLs showing in the reports that had blatantly been extracted from parts of Google's own JavaScript code that they supply for Google CSE and/or Google Analytics and which you insert in your pages. I long ago set up a redirect for those requests and they disappeared from the reports within a few weeks.

le_gber




msg:3765345
 1:08 pm on Oct 14, 2008 (gmt 0)

Very useful info indeed. I wondered where some of the 404 reported where coming from.

so I already have a rule in my .htaccess that sends a 301 redirect to the same URL with the trailing junk stripped off.
mind sharing it with us?

mark_roach




msg:3765347
 1:12 pm on Oct 14, 2008 (gmt 0)

Nice feature which throws up a question.

I have a number of pages where the "Linked From" column is "Unavailable".

It would be interesting to know where these URLs have come from, if they are not linked from another page.

Does it mean Google no longer has the original linking page in its index or is it more complex than that ?

Perhaps the URL was manually submitted or maybe Google sees a pattern in the URLs on your site and tries to guess some additional URLs.

I recently had a message via webmaster tools saying Google had found an excessive number of URLs on my site. Upon investigation this had been caused by a bug in one of my scripts which caused it to generate an infinite number of pages.

I fixed the bug and I now generate 404s for these URLs so they now appear in the WMT 404 report.

The interesting thing is the seed pages which started the infinite page creation all show up in WMT as Linked From - Unavailable.

And more interestingly they bear more than a passing resemblence to the pages noted by g1smd in his original post.

I have for a long time had various duff incoming links which were for a valid URL but with an additional underscore on the end

My duff pages take the form

www.mysite.com/validpage_.html where www.mysite.com/validpage.html exists.

Further I have a number of pages on my site which take the form www.mysite/dir/yyyy.html where yyyy is numeric.

WMT also shows some 404s with no linking page for www.mysite/dir/zzzz.html where zzzz is a random number.

Is Google guessing at pages on my site ?

g1smd




msg:3765402
 3:00 pm on Oct 14, 2008 (gmt 0)

*** mind sharing it with us? ***

Check the WebmasterWorld Apache forum. Code for many variations gets posted time and time again as it is a recurring discussion.

I have no idea if there are parallel topics for IIS in the respective forum here.

icedowl




msg:3765447
 3:53 pm on Oct 14, 2008 (gmt 0)

Thanks for bringing this up. It made me take a closer look at one of my 404's. By using the Wayback Machine I found that I'd lost the page when I did a site rebuild back in 2005. That page will be coming back to the site real soon (after I've had some sleep).

Reno




msg:3765494
 4:32 pm on Oct 14, 2008 (gmt 0)

I see that WMT no longer reports the *date* of last Home Page access. It just says the site was "visited"...

I certainly appreciate all the various info I can get at GWT but admittedly am disappointed to see the actual date of the last googlebot visit removed -- I found that very helpful.

....................

mcglynn




msg:3765521
 4:55 pm on Oct 14, 2008 (gmt 0)

BTW, Matt Cutts blogged about this feature recently:
[mattcutts.com...]

[edited by: tedster at 6:26 pm (utc) on Oct. 14, 2008]

g1smd




msg:3765541
 5:26 pm on Oct 14, 2008 (gmt 0)

*** ... disappointed to see the actual date of the last Googlebot visit removed ... ***

They never showed that. The date was for only the Home Page last visit. The crawl graphs give a much better idea - but the best data of all is in your server logs.

[edited by: tedster at 6:27 pm (utc) on Oct. 14, 2008]

rrussell




msg:3765577
 6:33 pm on Oct 14, 2008 (gmt 0)

Great find. Thanks for sharing, I've been wanting this feature for a while and never understood why it wasn't available.

SEOMike




msg:3765711
 9:54 pm on Oct 14, 2008 (gmt 0)

Does it mean Google no longer has the original linking page in its index or is it more complex than that ?

I suspect that these are pages that Google discovered through spidering & indexed but never found any external links for them. I see some pages in WMT from directories that I specifically told GBot to stay out of but the pages were discovered nonetheless. The pages were one-off things for friends and never had any links to them from anywhere. That's probably why they say "unavailable."

Is Google guessing at pages on my site ?

Probably. GBot will do that sometimes in order to determine how your server responds to random queries.

Reno




msg:3765736
 11:02 pm on Oct 14, 2008 (gmt 0)

I do note that if you go to Diagnostics > Content analysis, there is a "Last updated" date. I'm assuming that date indicates the most recent site-wide crawl? (meaning most, though not necessarily all the pages)

...........................

g1smd




msg:3765757
 11:23 pm on Oct 14, 2008 (gmt 0)

No. Different parts of the report are updated at different times and at different frequencies.

For instance, I used to see the link reports update some 2 to 3 days after the "Home Page last visited" date was updated - and the reported date for the last Home Page Visit, at the moment the date was updated, always reported as being some 2 to 5 days (sometimes more) ago.

Most of the reports only update every few days, maybe only once per week, and the period varies, but the date on the Content Analysis report changes every day if there is nothing to report. If there is something to report, then the date changes only once every few days, or maybe only weekly, and again the period randomly changes.

trinorthlighting




msg:3765831
 3:03 am on Oct 15, 2008 (gmt 0)

Did you look under the Tools at the Enhance 404 pages (Experimental) and some of the Gadgets that are also there. Neat stuff.

maximillianos




msg:3766189
 4:01 pm on Oct 15, 2008 (gmt 0)

This is a great tool! I just fixed about 30 inbound links using 301 redirects. Typos, wrong case, etc.

Nice!

Gotta go fix the rest! There is about 80 total inbound links broken...

Thanks for posting this!

g1smd




msg:3766244
 4:51 pm on Oct 15, 2008 (gmt 0)

Are you fixing them one at a time with a specific redirect that works for one URL, or are you crafting a general rule for each mistake, one that fixes any incoming URL request with that particular typo?

maximillianos




msg:3766317
 6:13 pm on Oct 15, 2008 (gmt 0)

A little of both. Most of the link errors are due to case problems, so I'm adding some general case fix redirects to handle all of them at once, along with any future ones.

But some are unique typo problems, so I simply 301 it to the proper page, etc. It will of course work if someone else does the same typo, but it is unlikely in a few of these cases.

This has been a great find. I've now fixed close to 50 legit inbound links. Granted I have thousands and thousands of total inbound links, but hey, every link counts!

nmjudy




msg:3766413
 8:26 pm on Oct 15, 2008 (gmt 0)

I'm experiencing some unexplained problems. Webmaster Tools is reporting broken links from some of my own pages - yet when I look at my source, I don't see the broken link anywhere (the page name that doesn't exist on my site).

I've opened up the page online and hit view source in the browser. Did a control-F to find the non-existing page that it was supposedly linking to and got zero results.

I've also done a whole site search with Dreamweaver looking for a document that it says I'm linking to - got nothing. There haven't been any changes since their reported date.

Is anyone having similar problems?

Lame_Wolf




msg:3766730
 3:23 am on Oct 16, 2008 (gmt 0)

Great find. Thanks for sharing, I've been wanting this feature for a while and never understood why it wasn't available.

Ditto.
About time this was added. I kept seeing 404s that I knew were not from the site but elsewhere.

One really annoying thing about this tool is when you press the "OK" button. The page flips back to the top of the page, forcing you to scroll down and find where you last were.

g1smd




msg:3766895
 8:45 am on Oct 16, 2008 (gmt 0)

I export the list as a .csv file and look at it as a spreadsheet.

There's a benefit to Google in releasing this data. If 50 000 webmasters each clean up 1000 links that's a respectable amount of "noise" removed from the link graph.

maximillianos




msg:3767016
 12:55 pm on Oct 16, 2008 (gmt 0)

Here is my dilemma. And I think I already know the answer. Do I fix the links that are coming from spammy auto-generated pages? I feel kind of crappy about fixing them since those pages are scammy crap anyway. But should I just say what the hell a link is a link?

Oliver Henniges




msg:3768605
 11:13 am on Oct 18, 2008 (gmt 0)

Out of 103 errors reported on my site, there was exactly ONE from a crappy external page. 102 of the errors came from googlebot not being able to cope adequately with relative internal paths on my website.

This is a phenomenon I observed quite often in GWTs diagnostics the past years.

I know that google recommends to use absolute hrefs wherever possible, and I am continuously trying to improve my site accordingly. But this is not in accordance with the general W3C-standards.

For instance, years ago, when modem speed was still quite common, I used to burn my website (html-pages) on CD and send it to some customers substituting a paper-catalogue. A weird idea nowadays, I know, but in those days it made sense. That's where those relative links (../../../widgets.html)came from.

With all the things you reported on googlebot following java-script links, I must say the engineers simply got some priorities wrong.

The tool in itself might of course be interesting and helpful, but what's the use of diagnostics from a doctor making fare more mistakes than I myself? The negative mental implications of google telling me "you have a really dirty website, naughty, naughty" are quite frustrating, particularly if it simply isn't correct (or outdated).

So, normally I don't look at this section.

docbird




msg:3768718
 4:19 pm on Oct 18, 2008 (gmt 0)

Good tool.
Tho I've several reports of links from pages on my sites that no longer exist (maybe URLs changed - however uncool), and w dates given as in 2006, say. Yet, reportedly these errors were discovered just days ago - as if these cases discovered from google's cache, rather than from current pages.

nmjudy




msg:3770874
 2:45 am on Oct 22, 2008 (gmt 0)

ARrrrrGH! Google Webmaster Tools is reporting broken links 'from my site' 'to my site' that don't exist.

A couple of months ago, I created a redirect in my .htaccess to redirect all index.html pages to the folder root. The code is below and live headers show that the code is working.

RewriteEngine on
#
# Redirect requests for index.html in any directory to "/" in the same directory
#
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.+/)?index\.html\ HTTP
RewriteRule ^(.+/)?index\.html$ http://www.example.com/$1 [R=301,L]
#

I only use .html for page extensions. It appears that Google Webmaster Tools is trying to find index.htm links. (?) It says that the source is from my own page - but there is no such link on the source page. Is there a way to force both an index.html AND index.htm redirect to the root folder with the rewrite code?

[edited by: tedster at 3:35 am (utc) on Oct. 22, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]

This 39 message thread spans 2 pages: 39 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved