homepage Welcome to WebmasterWorld Guest from 54.198.8.124
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Pages showing up in WMT as 'not found' don't actually exist
Copywriter39




msg:4492684
 3:52 pm on Sep 8, 2012 (gmt 0)

A client has been receiving a lot of crawl errors in their webmaster account, most of which are urls that are not being found. The problem is they don't exist. The site has a blog and it seems to be related to this. The pages that are said to be missing are listed as www.widgets/blog/product1. The problem is that the actual page is www.widgets/product1. For some reason its inserting blog into nearly 1000 product pages. Any ideas on what could be causing this?

 

lucy24




msg:4492798
 8:51 pm on Sep 8, 2012 (gmt 0)

Sounds like just another version of google's (in)famous habit of following non-links. If the page never existed, never did exist, and you've never said or done anything to lead anyone to believe there is such a page-- well, then a 404 is just what they deserve isn't it.

That's assuming

:: cough-cough ::

that you've fine-tooth-combed the site to make sure there aren't any glitches in relative links.

Don't know about the rest of youse, but if I make a typo in a link and correct it three minutes later, at least one major search engine will have crawled the page during those three minutes. Even if, or especially if, it's a page they normally visit once in three weeks, tops.

Copywriter39




msg:4492814
 9:42 pm on Sep 8, 2012 (gmt 0)

Maybe I'm misstating this. What we are finding are hundreds of actual pages are being indexed, but with the word blog inserted into the url for no reason. And these are product pages, not entries in the blog. So everything is right about the url except for the word blog being inserted into the url.

g1smd




msg:4492821
 11:26 pm on Sep 8, 2012 (gmt 0)

Sounds like some botched relative internal linking somewhere on the site.

Run Xenu LinkSleuth over the site and look very carefully at the reports.

phranque




msg:4492844
 2:16 am on Sep 9, 2012 (gmt 0)

when you request those indexed urls do you get a 404 status code response or a 200?
have you tried a fetch as googlebot in GWT?

levo




msg:4492846
 2:25 am on Sep 9, 2012 (gmt 0)

On Google WMT, click on the URL and check the 'linked from' tab.

Copywriter39




msg:4492853
 3:46 am on Sep 9, 2012 (gmt 0)

We get a 404 code and when I clicked linked from its from other urls that have the same problem.

lucy24




msg:4492898
 7:47 am on Sep 9, 2012 (gmt 0)

You mean that your nonexistent pages are linked from other nonexistent pages? Yup. Welcome to Webmaster Tools.

phranque




msg:4492912
 8:40 am on Sep 9, 2012 (gmt 0)

did you actually get a 404 status code response from the requested/non-existent url(s) or did you simply get shown or redirected to an error page?

have you tried a fetch as googlebot in GWT?

levo




msg:4492934
 11:56 am on Sep 9, 2012 (gmt 0)

It seems that your error pages have relative linking problem. Make sure that all links on your error pages are absolute. Even the 404 error page includes the correct header, Google takes note of your suggestions (links) on that page.

Sgt_Kickaxe




msg:4493095
 9:36 pm on Sep 9, 2012 (gmt 0)

Wordpress - a general note about the popular CMS that may apply to others like drupal as well.

Wordpress does a lot of redirecting on your behalf. If, for example, a url is example.com/this-is-an-example-page and you visit any of these very similar urls you will likely end up on the same url

- example.com/this-is-an-example-
- example.com/this-is-an-example-page..
- example.com/this-is-an-example-pa
- example.com/this-is-an-
- example.com/this-is-an-exa-randomgiberrish

Now those are all different urls but wordpress takes a best guess that the page the visitor really wanted was - example.com/this-is-an-example-page and it sends the visitor there, via 301, MOST OF THE TIME.

You need to make sure that variations of your urls, that don't exist as real destinations, don't simply display the RIGHT content on the WRONG(not redirected) url. You DO NOT want to see "example.com/this-is-an-example-page" content on "example.com/this-is-an-example-page.." (for example, note the two trailing periods).

To test simply add a couple of periods to the end of a wordpress url and see how your installation handles it (it varies by host). Unfortunately there is no easy way to turn this wordpress feature off.

Copywriter39




msg:4493115
 12:15 am on Sep 10, 2012 (gmt 0)

It's not a wordpress site. It uses some sort of shopping program, but not sure which.

Copywriter39




msg:4493143
 4:35 am on Sep 10, 2012 (gmt 0)

I was wrong. The blog is actually wordpress, although the site itself isn't. I did run a report with Xenu LinkSleuth and nothing appeared.

phranque




msg:4493193
 10:40 am on Sep 10, 2012 (gmt 0)

i'm still curious if you actually got a 404 status code response from the "non-existent" url(s) and/or have you tried a fetch as googlebot in GWT?

AG4Life




msg:4493207
 12:07 pm on Sep 10, 2012 (gmt 0)

I have a similar problem, although the new 404's I got (I also received a warning of "Increase in not found errors") are historical pages that I've long since removed (hence the 404s). They are not linked to internally any more, even though Google still says they are.

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

g1smd




msg:4493214
 12:21 pm on Sep 10, 2012 (gmt 0)

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

They do, and they need to ask for the page that links to the page that links to the page and confirm 404 all the way back up the chain before the data in WMT will change. This can take many months. As long as the list of 404 pages shows URLs that are really 404, there's nothing more to do.

Copywriter39




msg:4493230
 1:09 pm on Sep 10, 2012 (gmt 0)

When I use fetch as googlebot, it simply says the page is not found. If I do the home page everything seems fine. It just seems as if Google is creating these pages when they spider the site.

tedster




msg:4493319
 3:12 pm on Sep 10, 2012 (gmt 0)

It just seems as if Google is creating these pages when they spider the site.

That's because Google doesn't exactly "spider" or "crawl" a site - at least not the old style way. Instead Google builds a list of URLs that they've discovered and then they assign URLs from that list to googlebot.

In other words, a crawl is not done by hitting the home page, following the links that are currently there, following more links on those new pages, etc. So, historical links still DO get requested and the WMT report will show them as 404 is they are currently 404.

This does not mean that these "crawl errors" are considered a problem. They are only errors in a purely technical sense.

If you want these URLs to return a 404 status, and they do, and there are no internal links left that point to them, then you are OK. At that point you can consider the Webmaster Tools report to be an FYI only. It's not a a list of things you still need to fix or else suffer some ranking problem.

AG4Life




msg:4493332
 3:40 pm on Sep 10, 2012 (gmt 0)

Thanks for the explanations, it all makes sense now.

I wonder why these 404s are showing up again though, 28 month after I made the changes that led to the creation of these 404s. I'm fairly sure most of them have already been processed in the past by Google (ie. they've shown up in WMT before, if I can remember, not too long after I made these changes). Maybe just Google being extra thorough/re-checking just to be sure?

g1smd




msg:4493336
 4:01 pm on Sep 10, 2012 (gmt 0)

Google checks every URL they have ever seen again and again on an occasional basis forever. I have some pages that haven't existed for about 7 years that Google still requests one or two times per year.

They do this because a large proportion of page URLs that don't exist do eventually come back into use.

Copywriter39




msg:4493360
 4:24 pm on Sep 10, 2012 (gmt 0)

Thanks tedster. Since these pages never existed I still don't know why they would list them.

g1smd




msg:4493361
 4:27 pm on Sep 10, 2012 (gmt 0)

They will have gleaned the URL from a link, in this case probably a malformed one.

A URL "exists" as soon as there is a link, whether or not that URL resolves to a live server or not. If the URL does resolve to a server, the URL "exists" whether or not the URL request can be serviced by returning a page or not.

Bill_H




msg:4493521
 11:09 pm on Sep 10, 2012 (gmt 0)

Any chance the links are coming from some scraper sites? We have several hundred "incomplete" links showing in WMT as 404s, that at first glance look like malformed links but are really GoogleBot following text that looks like a link but is not from scraper sites. As in a 404 for www.somedomain.com/somedirectory/someproduc when the actual link was www.somedomain.com/somedirectory/someproductnamedTedster.html

Cheers,
Bill

manny123




msg:4493766
 1:43 pm on Sep 11, 2012 (gmt 0)

I have noticed an uptick in 404s in GWT as well. In my case, I am using the _trackPageview feature of GA to alias the name of some of my long tail pages like so:

_gaq.push(['_trackPageview', '/category/subcat/etc/']);

Google is treating these like real URLs when I am just using them as an alias to make my analytics more useful.

Double check the "view source" on some of your pages and make sure that you don't find the offending path somewhere in your HTML.

AG4Life




msg:4497241
 8:29 am on Sep 19, 2012 (gmt 0)

Just an update to my situation, which has seen 4000-5000 new 404s being added every week. Some are the older links as I described, but most of the new ones seems to be of the format http://example.com/1345876986000, with a random number at the end (but always starting with 1345).

It looks like a unix timestamp with a few extra zeroes.

Apparently this is known issue, possibly to do with the Disqus plug-in (seems to be the common factor). Google's JohnMu has acknowledged that something isn't quiet right most likely with their Javascript crawler, but as long as these URLs return the proper 404 response, then it's "not something that would affect your site's indexing or ranking".

Google Product Forum thread here:

[productforums.google.com...]

indyank




msg:4497371
 3:42 pm on Sep 19, 2012 (gmt 0)

Exactly, most of the time these never existed urls are the result of their javascript crawlers that were introduced in 2011.They seem to discover or generate links by parsing the script and many times these result in these non-existent urls being generated and crawled resulting in 404 errors. I have discovered this way back in march 2011.

But be careful with http://example.com/some-slug-here/1345876986000 formats in wordpress as it will actually return 200 OK with the content of http://example.com/some-slug-here/.The best way to address this is by using canonical urls.

indyank




msg:4497373
 3:46 pm on Sep 19, 2012 (gmt 0)

www.widgets/blog/product1


is there any javascript on you site template that has/uses the word "blog" within the script? google might be parsing that script and discovering these urls.Then when they crawl it and don't find them, they report 404 errors. It is again the result of their buggy javascript crawler.

g1smd




msg:4497379
 3:52 pm on Sep 19, 2012 (gmt 0)

I get requests for
example.com/$1 from Google on several sites due to their wayward javascript detection.
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved