Welcome to WebmasterWorld Guest from 54.162.239.134

Message Too Old, No Replies

Pages showing up in WMT as 'not found' don't actually exist

     
3:52 pm on Sep 8, 2012 (gmt 0)

5+ Year Member



A client has been receiving a lot of crawl errors in their webmaster account, most of which are urls that are not being found. The problem is they don't exist. The site has a blog and it seems to be related to this. The pages that are said to be missing are listed as www.widgets/blog/product1. The problem is that the actual page is www.widgets/product1. For some reason its inserting blog into nearly 1000 product pages. Any ideas on what could be causing this?
8:51 pm on Sep 8, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Sounds like just another version of google's (in)famous habit of following non-links. If the page never existed, never did exist, and you've never said or done anything to lead anyone to believe there is such a page-- well, then a 404 is just what they deserve isn't it.

That's assuming

:: cough-cough ::

that you've fine-tooth-combed the site to make sure there aren't any glitches in relative links.

Don't know about the rest of youse, but if I make a typo in a link and correct it three minutes later, at least one major search engine will have crawled the page during those three minutes. Even if, or especially if, it's a page they normally visit once in three weeks, tops.
9:42 pm on Sep 8, 2012 (gmt 0)

5+ Year Member



Maybe I'm misstating this. What we are finding are hundreds of actual pages are being indexed, but with the word blog inserted into the url for no reason. And these are product pages, not entries in the blog. So everything is right about the url except for the word blog being inserted into the url.
11:26 pm on Sep 8, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Sounds like some botched relative internal linking somewhere on the site.

Run Xenu LinkSleuth over the site and look very carefully at the reports.
2:16 am on Sep 9, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



when you request those indexed urls do you get a 404 status code response or a 200?
have you tried a fetch as googlebot in GWT?
2:25 am on Sep 9, 2012 (gmt 0)

10+ Year Member



On Google WMT, click on the URL and check the 'linked from' tab.
3:46 am on Sep 9, 2012 (gmt 0)

5+ Year Member



We get a 404 code and when I clicked linked from its from other urls that have the same problem.
7:47 am on Sep 9, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



You mean that your nonexistent pages are linked from other nonexistent pages? Yup. Welcome to Webmaster Tools.
8:40 am on Sep 9, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



did you actually get a 404 status code response from the requested/non-existent url(s) or did you simply get shown or redirected to an error page?

have you tried a fetch as googlebot in GWT?
11:56 am on Sep 9, 2012 (gmt 0)

10+ Year Member



It seems that your error pages have relative linking problem. Make sure that all links on your error pages are absolute. Even the 404 error page includes the correct header, Google takes note of your suggestions (links) on that page.
9:36 pm on Sep 9, 2012 (gmt 0)

WebmasterWorld Senior Member sgt_kickaxe is a WebmasterWorld Top Contributor of All Time 5+ Year Member



Wordpress - a general note about the popular CMS that may apply to others like drupal as well.

Wordpress does a lot of redirecting on your behalf. If, for example, a url is example.com/this-is-an-example-page and you visit any of these very similar urls you will likely end up on the same url

- example.com/this-is-an-example-
- example.com/this-is-an-example-page..
- example.com/this-is-an-example-pa
- example.com/this-is-an-
- example.com/this-is-an-exa-randomgiberrish

Now those are all different urls but wordpress takes a best guess that the page the visitor really wanted was - example.com/this-is-an-example-page and it sends the visitor there, via 301, MOST OF THE TIME.

You need to make sure that variations of your urls, that don't exist as real destinations, don't simply display the RIGHT content on the WRONG(not redirected) url. You DO NOT want to see "example.com/this-is-an-example-page" content on "example.com/this-is-an-example-page.." (for example, note the two trailing periods).

To test simply add a couple of periods to the end of a wordpress url and see how your installation handles it (it varies by host). Unfortunately there is no easy way to turn this wordpress feature off.
12:15 am on Sep 10, 2012 (gmt 0)

5+ Year Member



It's not a wordpress site. It uses some sort of shopping program, but not sure which.
4:35 am on Sep 10, 2012 (gmt 0)

5+ Year Member



I was wrong. The blog is actually wordpress, although the site itself isn't. I did run a report with Xenu LinkSleuth and nothing appeared.
10:40 am on Sep 10, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



i'm still curious if you actually got a 404 status code response from the "non-existent" url(s) and/or have you tried a fetch as googlebot in GWT?
12:07 pm on Sep 10, 2012 (gmt 0)

5+ Year Member



I have a similar problem, although the new 404's I got (I also received a warning of "Increase in not found errors") are historical pages that I've long since removed (hence the 404s). They are not linked to internally any more, even though Google still says they are.

It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.
12:21 pm on Sep 10, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



It's almost as if Google is using historical versions of our pages to visit historical links to presently non existent pages, and then giving errors for it.

They do, and they need to ask for the page that links to the page that links to the page and confirm 404 all the way back up the chain before the data in WMT will change. This can take many months. As long as the list of 404 pages shows URLs that are really 404, there's nothing more to do.
1:09 pm on Sep 10, 2012 (gmt 0)

5+ Year Member



When I use fetch as googlebot, it simply says the page is not found. If I do the home page everything seems fine. It just seems as if Google is creating these pages when they spider the site.
3:12 pm on Sep 10, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



It just seems as if Google is creating these pages when they spider the site.

That's because Google doesn't exactly "spider" or "crawl" a site - at least not the old style way. Instead Google builds a list of URLs that they've discovered and then they assign URLs from that list to googlebot.

In other words, a crawl is not done by hitting the home page, following the links that are currently there, following more links on those new pages, etc. So, historical links still DO get requested and the WMT report will show them as 404 is they are currently 404.

This does not mean that these "crawl errors" are considered a problem. They are only errors in a purely technical sense.

If you want these URLs to return a 404 status, and they do, and there are no internal links left that point to them, then you are OK. At that point you can consider the Webmaster Tools report to be an FYI only. It's not a a list of things you still need to fix or else suffer some ranking problem.
3:40 pm on Sep 10, 2012 (gmt 0)

5+ Year Member



Thanks for the explanations, it all makes sense now.

I wonder why these 404s are showing up again though, 28 month after I made the changes that led to the creation of these 404s. I'm fairly sure most of them have already been processed in the past by Google (ie. they've shown up in WMT before, if I can remember, not too long after I made these changes). Maybe just Google being extra thorough/re-checking just to be sure?
4:01 pm on Sep 10, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Google checks every URL they have ever seen again and again on an occasional basis forever. I have some pages that haven't existed for about 7 years that Google still requests one or two times per year.

They do this because a large proportion of page URLs that don't exist do eventually come back into use.
4:24 pm on Sep 10, 2012 (gmt 0)

5+ Year Member



Thanks tedster. Since these pages never existed I still don't know why they would list them.
4:27 pm on Sep 10, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



They will have gleaned the URL from a link, in this case probably a malformed one.

A URL "exists" as soon as there is a link, whether or not that URL resolves to a live server or not. If the URL does resolve to a server, the URL "exists" whether or not the URL request can be serviced by returning a page or not.
11:09 pm on Sep 10, 2012 (gmt 0)

5+ Year Member



Any chance the links are coming from some scraper sites? We have several hundred "incomplete" links showing in WMT as 404s, that at first glance look like malformed links but are really GoogleBot following text that looks like a link but is not from scraper sites. As in a 404 for www.somedomain.com/somedirectory/someproduc when the actual link was www.somedomain.com/somedirectory/someproductnamedTedster.html

Cheers,
Bill
1:43 pm on Sep 11, 2012 (gmt 0)

5+ Year Member



I have noticed an uptick in 404s in GWT as well. In my case, I am using the _trackPageview feature of GA to alias the name of some of my long tail pages like so:

_gaq.push(['_trackPageview', '/category/subcat/etc/']);

Google is treating these like real URLs when I am just using them as an alias to make my analytics more useful.

Double check the "view source" on some of your pages and make sure that you don't find the offending path somewhere in your HTML.
8:29 am on Sep 19, 2012 (gmt 0)

5+ Year Member



Just an update to my situation, which has seen 4000-5000 new 404s being added every week. Some are the older links as I described, but most of the new ones seems to be of the format http://example.com/1345876986000, with a random number at the end (but always starting with 1345).

It looks like a unix timestamp with a few extra zeroes.

Apparently this is known issue, possibly to do with the Disqus plug-in (seems to be the common factor). Google's JohnMu has acknowledged that something isn't quiet right most likely with their Javascript crawler, but as long as these URLs return the proper 404 response, then it's "not something that would affect your site's indexing or ranking".

Google Product Forum thread here:

[productforums.google.com...]
3:42 pm on Sep 19, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Exactly, most of the time these never existed urls are the result of their javascript crawlers that were introduced in 2011.They seem to discover or generate links by parsing the script and many times these result in these non-existent urls being generated and crawled resulting in 404 errors. I have discovered this way back in march 2011.

But be careful with http://example.com/some-slug-here/1345876986000 formats in wordpress as it will actually return 200 OK with the content of http://example.com/some-slug-here/.The best way to address this is by using canonical urls.
3:46 pm on Sep 19, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



www.widgets/blog/product1


is there any javascript on you site template that has/uses the word "blog" within the script? google might be parsing that script and discovering these urls.Then when they crawl it and don't find them, they report 404 errors. It is again the result of their buggy javascript crawler.
3:52 pm on Sep 19, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I get requests for
example.com/$1
from Google on several sites due to their wayward javascript detection.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month