Pages showing up in WMT as 'not found' don't actually exist

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages showing up in WMT as 'not found' don't actually exist

Copywriter39

3:52 pm on Sep 8, 2012 (gmt 0)

A client has been receiving a lot of crawl errors in their webmaster account, most of which are urls that are not being found. The problem is they don't exist. The site has a blog and it seems to be related to this. The pages that are said to be missing are listed as www.widgets/blog/product1. The problem is that the actual page is www.widgets/product1. For some reason its inserting blog into nearly 1000 product pages. Any ideas on what could be causing this?

lucy24

8:51 pm on Sep 8, 2012 (gmt 0)

Sounds like just another version of google's (in)famous habit of following non-links. If the page never existed, never did exist, and you've never said or done anything to lead anyone to believe there is such a page-- well, then a 404 is just what they deserve isn't it.

That's assuming

:: cough-cough ::

that you've fine-tooth-combed the site to make sure there aren't any glitches in relative links.

Don't know about the rest of youse, but if I make a typo in a link and correct it three minutes later, at least one major search engine will have crawled the page during those three minutes. Even if, or especially if, it's a page they normally visit once in three weeks, tops.

Copywriter39

9:42 pm on Sep 8, 2012 (gmt 0)

Maybe I'm misstating this. What we are finding are hundreds of actual pages are being indexed, but with the word blog inserted into the url for no reason. And these are product pages, not entries in the blog. So everything is right about the url except for the word blog being inserted into the url.

g1smd

11:26 pm on Sep 8, 2012 (gmt 0)

Sounds like some botched relative internal linking somewhere on the site.

Run Xenu LinkSleuth over the site and look very carefully at the reports.

phranque

2:16 am on Sep 9, 2012 (gmt 0)

when you request those indexed urls do you get a 404 status code response or a 200?
have you tried a fetch as googlebot in GWT?

levo

2:25 am on Sep 9, 2012 (gmt 0)

On Google WMT, click on the URL and check the 'linked from' tab.

Copywriter39

3:46 am on Sep 9, 2012 (gmt 0)

We get a 404 code and when I clicked linked from its from other urls that have the same problem.

lucy24

7:47 am on Sep 9, 2012 (gmt 0)

You mean that your nonexistent pages are linked from other nonexistent pages? Yup. Welcome to Webmaster Tools.

phranque

8:40 am on Sep 9, 2012 (gmt 0)

did you actually get a 404 status code response from the requested/non-existent url(s) or did you simply get shown or redirected to an error page?

have you tried a fetch as googlebot in GWT?

levo

11:56 am on Sep 9, 2012 (gmt 0)

It seems that your error pages have relative linking problem. Make sure that all links on your error pages are absolute. Even the 404 error page includes the correct header, Google takes note of your suggestions (links) on that page.

Sgt_Kickaxe

9:36 pm on Sep 9, 2012 (gmt 0)

Wordpress - a general note about the popular CMS that may apply to others like drupal as well.

Wordpress does a lot of redirecting on your behalf. If, for example, a url is example.com/this-is-an-example-page and you visit any of these very similar urls you will likely end up on the same url

- example.com/this-is-an-example-
- example.com/this-is-an-example-page..
- example.com/this-is-an-example-pa
- example.com/this-is-an-
- example.com/this-is-an-exa-randomgiberrish

Now those are all different urls but wordpress takes a best guess that the page the visitor really wanted was - example.com/this-is-an-example-page and it sends the visitor there, via 301, MOST OF THE TIME.

You need to make sure that variations of your urls, that don't exist as real destinations, don't simply display the RIGHT content on the WRONG(not redirected) url. You DO NOT want to see "example.com/this-is-an-example-page" content on "example.com/this-is-an-example-page.." (for example, note the two trailing periods).

To test simply add a couple of periods to the end of a wordpress url and see how your installation handles it (it varies by host). Unfortunately there is no easy way to turn this wordpress feature off.

Copywriter39

12:15 am on Sep 10, 2012 (gmt 0)

It's not a wordpress site. It uses some sort of shopping program, but not sure which.

Copywriter39

4:35 am on Sep 10, 2012 (gmt 0)

I was wrong. The blog is actually wordpress, although the site itself isn't. I did run a report with Xenu LinkSleuth and nothing appeared.

phranque

10:40 am on Sep 10, 2012 (gmt 0)

i'm still curious if you actually got a 404 status code response from the "non-existent" url(s) and/or have you tried a fetch as googlebot in GWT?

AG4Life

12:07 pm on Sep 10, 2012 (gmt 0)

I have a similar problem, although the new 404's I got (I also received a warning of "Increase in not found errors") are historical pages that I've long since removed (hence the 404s). They are not linked to internally any more, even though Google still says they are.

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

g1smd

12:21 pm on Sep 10, 2012 (gmt 0)

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

They do, and they need to ask for the page that links to the page that links to the page and confirm 404 all the way back up the chain before the data in WMT will change. This can take many months. As long as the list of 404 pages shows URLs that are really 404, there's nothing more to do.

Copywriter39

1:09 pm on Sep 10, 2012 (gmt 0)

When I use fetch as googlebot, it simply says the page is not found. If I do the home page everything seems fine. It just seems as if Google is creating these pages when they spider the site.

tedster

3:12 pm on Sep 10, 2012 (gmt 0)

It just seems as if Google is creating these pages when they spider the site.

That's because Google doesn't exactly "spider" or "crawl" a site - at least not the old style way. Instead Google builds a list of URLs that they've discovered and then they assign URLs from that list to googlebot.

In other words, a crawl is not done by hitting the home page, following the links that are currently there, following more links on those new pages, etc. So, historical links still DO get requested and the WMT report will show them as 404 is they are currently 404.

This does not mean that these "crawl errors" are considered a problem. They are only errors in a purely technical sense.

If you want these URLs to return a 404 status, and they do, and there are no internal links left that point to them, then you are OK. At that point you can consider the Webmaster Tools report to be an FYI only. It's not a a list of things you still need to fix or else suffer some ranking problem.

AG4Life

3:40 pm on Sep 10, 2012 (gmt 0)

Thanks for the explanations, it all makes sense now.

I wonder why these 404s are showing up again though, 28 month after I made the changes that led to the creation of these 404s. I'm fairly sure most of them have already been processed in the past by Google (ie. they've shown up in WMT before, if I can remember, not too long after I made these changes). Maybe just Google being extra thorough/re-checking just to be sure?

g1smd

4:01 pm on Sep 10, 2012 (gmt 0)

Google checks every URL they have ever seen again and again on an occasional basis forever. I have some pages that haven't existed for about 7 years that Google still requests one or two times per year.

They do this because a large proportion of page URLs that don't exist do eventually come back into use.

Copywriter39

4:24 pm on Sep 10, 2012 (gmt 0)

Thanks tedster. Since these pages never existed I still don't know why they would list them.

g1smd

4:27 pm on Sep 10, 2012 (gmt 0)

They will have gleaned the URL from a link, in this case probably a malformed one.

A URL "exists" as soon as there is a link, whether or not that URL resolves to a live server or not. If the URL does resolve to a server, the URL "exists" whether or not the URL request can be serviced by returning a page or not.

Bill_H

11:09 pm on Sep 10, 2012 (gmt 0)

Any chance the links are coming from some scraper sites? We have several hundred "incomplete" links showing in WMT as 404s, that at first glance look like malformed links but are really GoogleBot following text that looks like a link but is not from scraper sites. As in a 404 for www.somedomain.com/somedirectory/someproduc when the actual link was www.somedomain.com/somedirectory/someproductnamedTedster.html

Cheers,
Bill

manny123

1:43 pm on Sep 11, 2012 (gmt 0)

I have noticed an uptick in 404s in GWT as well. In my case, I am using the _trackPageview feature of GA to alias the name of some of my long tail pages like so:

_gaq.push(['_trackPageview', '/category/subcat/etc/']);

Google is treating these like real URLs when I am just using them as an alias to make my analytics more useful.

Double check the "view source" on some of your pages and make sure that you don't find the offending path somewhere in your HTML.

AG4Life

8:29 am on Sep 19, 2012 (gmt 0)

Just an update to my situation, which has seen 4000-5000 new 404s being added every week. Some are the older links as I described, but most of the new ones seems to be of the format http://example.com/1345876986000, with a random number at the end (but always starting with 1345).

It looks like a unix timestamp with a few extra zeroes.

Apparently this is known issue, possibly to do with the Disqus plug-in (seems to be the common factor). Google's JohnMu has acknowledged that something isn't quiet right most likely with their Javascript crawler, but as long as these URLs return the proper 404 response, then it's "not something that would affect your site's indexing or ranking".

Google Product Forum thread here:

[productforums.google.com...]

indyank

3:42 pm on Sep 19, 2012 (gmt 0)

Exactly, most of the time these never existed urls are the result of their javascript crawlers that were introduced in 2011.They seem to discover or generate links by parsing the script and many times these result in these non-existent urls being generated and crawled resulting in 404 errors. I have discovered this way back in march 2011.

But be careful with http://example.com/some-slug-here/1345876986000 formats in wordpress as it will actually return 200 OK with the content of http://example.com/some-slug-here/.The best way to address this is by using canonical urls.

indyank

3:46 pm on Sep 19, 2012 (gmt 0)

www.widgets/blog/product1

is there any javascript on you site template that has/uses the word "blog" within the script? google might be parsing that script and discovering these urls.Then when they crawl it and don't find them, they report 404 errors. It is again the result of their buggy javascript crawler.

g1smd

3:52 pm on Sep 19, 2012 (gmt 0)

I get requests for

example.com/$1

from Google on several sites due to their wayward javascript detection.