Welcome to WebmasterWorld Guest from 54.227.72.69

Forum Moderators: Robert Charlton & aakk9999 & andy langton & goodroi

Message Too Old, No Replies

Pages showing up in WMT as 'not found' don't actually exist

     
3:52 pm on Sep 8, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


A client has been receiving a lot of crawl errors in their webmaster account, most of which are urls that are not being found. The problem is they don't exist. The site has a blog and it seems to be related to this. The pages that are said to be missing are listed as www.widgets/blog/product1. The problem is that the actual page is www.widgets/product1. For some reason its inserting blog into nearly 1000 product pages. Any ideas on what could be causing this?
8:51 pm on Sept 8, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


Sounds like just another version of google's (in)famous habit of following non-links. If the page never existed, never did exist, and you've never said or done anything to lead anyone to believe there is such a page-- well, then a 404 is just what they deserve isn't it.

That's assuming

:: cough-cough ::

that you've fine-tooth-combed the site to make sure there aren't any glitches in relative links.

Don't know about the rest of youse, but if I make a typo in a link and correct it three minutes later, at least one major search engine will have crawled the page during those three minutes. Even if, or especially if, it's a page they normally visit once in three weeks, tops.
9:42 pm on Sept 8, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


Maybe I'm misstating this. What we are finding are hundreds of actual pages are being indexed, but with the word blog inserted into the url for no reason. And these are product pages, not entries in the blog. So everything is right about the url except for the word blog being inserted into the url.
11:26 pm on Sept 8, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Sounds like some botched relative internal linking somewhere on the site.

Run Xenu LinkSleuth over the site and look very carefully at the reports.
2:16 am on Sept 9, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10562
votes: 14


when you request those indexed urls do you get a 404 status code response or a 200?
have you tried a fetch as googlebot in GWT?
2:25 am on Sept 9, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 12, 2004
posts:610
votes: 3


On Google WMT, click on the URL and check the 'linked from' tab.
3:46 am on Sept 9, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


We get a 404 code and when I clicked linked from its from other urls that have the same problem.
7:47 am on Sept 9, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:13218
votes: 348


You mean that your nonexistent pages are linked from other nonexistent pages? Yup. Welcome to Webmaster Tools.
8:40 am on Sept 9, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10562
votes: 14


did you actually get a 404 status code response from the requested/non-existent url(s) or did you simply get shown or redirected to an error page?

have you tried a fetch as googlebot in GWT?
11:56 am on Sept 9, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 12, 2004
posts:610
votes: 3


It seems that your error pages have relative linking problem. Make sure that all links on your error pages are absolute. Even the 404 error page includes the correct header, Google takes note of your suggestions (links) on that page.
9:36 pm on Sept 9, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member

joined:Apr 14, 2010
posts:3169
votes: 0


Wordpress - a general note about the popular CMS that may apply to others like drupal as well.

Wordpress does a lot of redirecting on your behalf. If, for example, a url is example.com/this-is-an-example-page and you visit any of these very similar urls you will likely end up on the same url

- example.com/this-is-an-example-
- example.com/this-is-an-example-page..
- example.com/this-is-an-example-pa
- example.com/this-is-an-
- example.com/this-is-an-exa-randomgiberrish

Now those are all different urls but wordpress takes a best guess that the page the visitor really wanted was - example.com/this-is-an-example-page and it sends the visitor there, via 301, MOST OF THE TIME.

You need to make sure that variations of your urls, that don't exist as real destinations, don't simply display the RIGHT content on the WRONG(not redirected) url. You DO NOT want to see "example.com/this-is-an-example-page" content on "example.com/this-is-an-example-page.." (for example, note the two trailing periods).

To test simply add a couple of periods to the end of a wordpress url and see how your installation handles it (it varies by host). Unfortunately there is no easy way to turn this wordpress feature off.
12:15 am on Sept 10, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


It's not a wordpress site. It uses some sort of shopping program, but not sure which.
4:35 am on Sept 10, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


I was wrong. The blog is actually wordpress, although the site itself isn't. I did run a report with Xenu LinkSleuth and nothing appeared.
10:40 am on Sept 10, 2012 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10562
votes: 14


i'm still curious if you actually got a 404 status code response from the "non-existent" url(s) and/or have you tried a fetch as googlebot in GWT?
12:07 pm on Sept 10, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:July 11, 2008
posts:104
votes: 0


I have a similar problem, although the new 404's I got (I also received a warning of "Increase in not found errors") are historical pages that I've long since removed (hence the 404s). They are not linked to internally any more, even though Google still says they are.

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.
12:21 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

They do, and they need to ask for the page that links to the page that links to the page and confirm 404 all the way back up the chain before the data in WMT will change. This can take many months. As long as the list of 404 pages shows URLs that are really 404, there's nothing more to do.
1:09 pm on Sept 10, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


When I use fetch as googlebot, it simply says the page is not found. If I do the home page everything seems fine. It just seems as if Google is creating these pages when they spider the site.
3:12 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
posts:37301
votes: 0


It just seems as if Google is creating these pages when they spider the site.

That's because Google doesn't exactly "spider" or "crawl" a site - at least not the old style way. Instead Google builds a list of URLs that they've discovered and then they assign URLs from that list to googlebot.

In other words, a crawl is not done by hitting the home page, following the links that are currently there, following more links on those new pages, etc. So, historical links still DO get requested and the WMT report will show them as 404 is they are currently 404.

This does not mean that these "crawl errors" are considered a problem. They are only errors in a purely technical sense.

If you want these URLs to return a 404 status, and they do, and there are no internal links left that point to them, then you are OK. At that point you can consider the Webmaster Tools report to be an FYI only. It's not a a list of things you still need to fix or else suffer some ranking problem.
3:40 pm on Sept 10, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:July 11, 2008
posts:104
votes: 0


Thanks for the explanations, it all makes sense now.

I wonder why these 404s are showing up again though, 28 month after I made the changes that led to the creation of these 404s. I'm fairly sure most of them have already been processed in the past by Google (ie. they've shown up in WMT before, if I can remember, not too long after I made these changes). Maybe just Google being extra thorough/re-checking just to be sure?
4:01 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


Google checks every URL they have ever seen again and again on an occasional basis forever. I have some pages that haven't existed for about 7 years that Google still requests one or two times per year.

They do this because a large proportion of page URLs that don't exist do eventually come back into use.
4:24 pm on Sept 10, 2012 (gmt 0)

New User

5+ Year Member

joined:Feb 15, 2008
posts: 20
votes: 0


Thanks tedster. Since these pages never existed I still don't know why they would list them.
4:27 pm on Sept 10, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


They will have gleaned the URL from a link, in this case probably a malformed one.

A URL "exists" as soon as there is a link, whether or not that URL resolves to a live server or not. If the URL does resolve to a server, the URL "exists" whether or not the URL request can be serviced by returning a page or not.
11:09 pm on Sept 10, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:Oct 6, 2006
posts:49
votes: 0


Any chance the links are coming from some scraper sites? We have several hundred "incomplete" links showing in WMT as 404s, that at first glance look like malformed links but are really GoogleBot following text that looks like a link but is not from scraper sites. As in a 404 for www.somedomain.com/somedirectory/someproduc when the actual link was www.somedomain.com/somedirectory/someproductnamedTedster.html

Cheers,
Bill
1:43 pm on Sept 11, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:July 13, 2009
posts: 61
votes: 0


I have noticed an uptick in 404s in GWT as well. In my case, I am using the _trackPageview feature of GA to alias the name of some of my long tail pages like so:

_gaq.push(['_trackPageview', '/category/subcat/etc/']);

Google is treating these like real URLs when I am just using them as an alias to make my analytics more useful.

Double check the "view source" on some of your pages and make sure that you don't find the offending path somewhere in your HTML.
8:29 am on Sept 19, 2012 (gmt 0)

Junior Member

5+ Year Member

joined:July 11, 2008
posts:104
votes: 0


Just an update to my situation, which has seen 4000-5000 new 404s being added every week. Some are the older links as I described, but most of the new ones seems to be of the format http://example.com/1345876986000, with a random number at the end (but always starting with 1345).

It looks like a unix timestamp with a few extra zeroes.

Apparently this is known issue, possibly to do with the Disqus plug-in (seems to be the common factor). Google's JohnMu has acknowledged that something isn't quiet right most likely with their Javascript crawler, but as long as these URLs return the proper 404 response, then it's "not something that would affect your site's indexing or ranking".

Google Product Forum thread here:

[productforums.google.com...]
3:42 pm on Sept 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


Exactly, most of the time these never existed urls are the result of their javascript crawlers that were introduced in 2011.They seem to discover or generate links by parsing the script and many times these result in these non-existent urls being generated and crawled resulting in 404 errors. I have discovered this way back in march 2011.

But be careful with http://example.com/some-slug-here/1345876986000 formats in wordpress as it will actually return 200 OK with the content of http://example.com/some-slug-here/.The best way to address this is by using canonical urls.
3:46 pm on Sept 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:Mar 9, 2010
posts:1806
votes: 9


www.widgets/blog/product1


is there any javascript on you site template that has/uses the word "blog" within the script? google might be parsing that script and discovering these urls.Then when they crawl it and don't find them, they report 404 errors. It is again the result of their buggy javascript crawler.
3:52 pm on Sept 19, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


I get requests for
example.com/$1
from Google on several sites due to their wayward javascript detection.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members