homepage Welcome to WebmasterWorld Guest from 54.226.191.80
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
WHY is google indexing my 404 page
can anyone give advice
Vimes




msg:734748
 4:18 am on Sep 23, 2005 (gmt 0)

Hi,

i'm using Isapi and for some reason G has started indexing my custom 404 page....

the server header is correct so why would G fully index this page.

this is not on....

not only has my page total trippled with pages that do not exist G is indexing my 404 page......

i'm living by the guidelines so why isn't G....

(bitter and twisted . com)

any help greatly appreciated

Vimes

 

powerfulponder




msg:734749
 4:04 am on Sep 25, 2005 (gmt 0)

How is the bot reaching your 404 pages? Does your site link to them in some way?

g1smd




msg:734750
 6:36 am on Sep 25, 2005 (gmt 0)

What URL is the "404 page" indexed as?

If it is directly accessing www.domain.com/errorpages/error404.html then that page will be served with status "200" of course.

You should also disallow the /errorpages/ folder in the robots.txt file or add <meta name="robots" content="noindex"> to the <head> section of each of your error pages (I assume that you have pages for error 401 and 403, etc, too).

If pages like www.domain.com/some.folder/some.content.page.html are still showing up in the index up to two years after the page no longer exists then this is widespread - easpecially for pages marked as Supplemental.

The Contractor




msg:734751
 12:23 pm on Sep 25, 2005 (gmt 0)

Simply place the following meta tag in your custom 404 page. I am assuming you have a custom 404 with your sites navigation in it.

<meta name="robots" content="noindex,follow">

Vimes




msg:734752
 10:03 am on Sep 26, 2005 (gmt 0)

Ok thanks g1smd,

That would seem like the problem.

Still trying to track down how the bot actually got to the /errorfolder/ Iím guessing the rule didnít or wasnít working correctly and thatís how Gbot got to the folder.

All that you mentioned seems to be happening. I dislike having to filter through thousands of pages that no longer exist just to see a more accurate page total.

I will block it in bot.txt how ever, but this only means that the pages will resurface the next time yahoo increases it index size :P

Thanks again.

Vimes.

johan




msg:734753
 11:25 am on Sep 26, 2005 (gmt 0)

All that is not needed. I am no expert but I am sure you have to set that page up as a 404 in your server admin and this allows the page to have a header (unseen to us) that says 404 page to the spider. So google then does not index that page. Anyhow I donít think itís a big deal and your site will live.

jd01




msg:734754
 5:23 pm on Sep 26, 2005 (gmt 0)

The most likely problem is you have your error page location defined incorrectly in either the httpd.conf file (usually set by the control panel) or in the .htaccess.

This will result in a 302 status, not a 404, but your custom error page will be served.
ErrorDocument 404 http://yoursite.com/yourpath/errordoc.html

This will result in a 404, with your custom error page being served.
ErrorDocument 404 /yourpath/errordoc.html

The fact that the page is being indexed at multiple URL's indicates that this is the problem -- SEs are very good at following HTTP standards, and if there was a 404 served when a page is not found they would not index the content of that location.

If what I mentioned is the case, the other corrections are really just work-arounds, not solutions.

You might try a header check [webmasterworld.com] if you are not sure what your server is telling SEs and browsers.

Justin

Brian




msg:734755
 6:54 pm on Sep 26, 2005 (gmt 0)

Pardon me if I'm being a little dumb here, but who cares if Google indexes 404 pages?

I've often seen it index mine, but then it does all kinds of strange stuff that matters not at all.

g1smd




msg:734756
 7:08 pm on Sep 26, 2005 (gmt 0)

If your 404 page is a duplicate of your main index page, then it can cause the index page to go AWOL.

It can also affect PR, and several other things. I think there is a much longer list from a few months back with a much longer list of problems that it can cause somewhere in the forum...

jd01




msg:734757
 10:07 pm on Sep 26, 2005 (gmt 0)

I've often seen it index mine, but then it does all kinds of strange stuff that matters not at all.

Used to think the same thing, then realized G (and Y, M, etc.) does not often do things that do not matter to them... I know the issue being discussed causes indexing problems, because a part of one of my sites is dynamic, and did not return a 404 when no results were found, just an empty page. SEs indexed these empty pages (about 300) and quit indexing the rest of my site, so I started serving proper 404 headers -- within about 3 weeks, my indexed pages jumped 10 fold in G and trippled in Y.

If you are trying to rank a site, it is generally best to follow the standards, and communicate with SEs correctly, even if it seems irrelevent or useless to do so.

Justin

Vimes




msg:734758
 4:24 am on Sep 27, 2005 (gmt 0)

Hi thanks all for your input,

Jd server header has always been the correct response on the 404 learnt that one a while ago but the confusing thing for me as Iíve looked in to this deeper and deeper over the last few days is and this is what Iím getting index at the moment:

1.http://mysite.com/yourpath/errordoc.html
2.http://mysite.com/folder/docname.html

Now page one is the path to my error page. Ok fine Gbot has somehow found my error folder and indexed the 404 page.

Page two is a "genuine" broken link which G has index the error page, cached dates everything, server header on this page is correct.

Checking through my logs, looking for some server problem I can find nothing. I think I even found the Gbot that indexed the page and it was given the correct 404 response. (Iím not 100% sure on the above statement as my log files are huge so I might have missed another hit)

Still lookingÖ. I agree SEís are good at following the Http standards and this is why Iím really confused.

I can only think there is / was a conflict between rules in the httpd. file but as of yet Iím not seeing it.

<sigh>

Vimes.

jd01




msg:734759
 5:16 am on Sep 27, 2005 (gmt 0)

Are your pages dynamic? If they are it could be causing issues.

Another idea I have had is if the error doc was sending the wrong header a some other time (used to be a 302 or something) the SEs may store this information in a separate location from the main URLs and they may have to 'expire' from that location before they are no longer requested. (IOW if a page was a 302 instead of a 404 that location may be stored differently and re-requested from the original location, until a pre-specified period of time has passed.)

Does not really explain why a new cache would be created now though...

If the headers are correct, and the page is not dynamic (if it is I have some other thoughts), you might try serving a blank page with a meta refresh after a few seconds to a page of links, and see if that helps...

Some other thoughts:

I would also check the full page headers using something like web-sniffer.net... It will allow you to check for proper 304's based on date and ETag. When you see a 404 the date should be current, and there should be no content length, mod date, ETag, etc served. You should also not have the option to check for a 304 if you do G might be getting a not modified header (304) instead of the 404 and be calling the cache updated, even though they are requesting, but not accessing the page.

You could test the above by making a change to the 404 error page, then check the cache of a (real) missing page that is cached by G even though it does not exist (obviously, must have a cache date later than the change made to the 404 page). If the change is present on the missing page, then G is re-caching your custom 404 page and somewhere it is not getting the right message from the headers (it would probably be getting either a 302 or 200 instead). If the change is not present, then you can pretty much assume G is getting a 304 for some reason, and keeping an old cache of the page as current.

If you can narrow it down to one of these two situations, it should be easier to figure out how to fix it.

Very interested to know if you find anything...

Justin

Vimes




msg:734760
 5:54 am on Sep 27, 2005 (gmt 0)

Hi Jd,

thanks thats given me some thought,

I had already made a change to the 404 page..
i'll get back when / if i know any more.

thanks again.

Vimes.

cws3di




msg:734761
 8:04 am on Sep 27, 2005 (gmt 0)

jd01
part of one of my sites is dynamic, and did not return a 404 when no results were found, just an empty page. SEs indexed these empty pages (about 300) and quit indexing the rest of my site, so I started serving proper 404 headers --

**************************************

I would really appreciate a explanation of how you solved this problem. I have a relatively simple php site where I have kept the parameters short and simple, but G came in a while back and spidered a bunch of non-existent parameters. Those returned "empty pages" rather than 404's, and of course they all are "duplicate" empty pages. How to prevent this?

www.example.com/page1.php

I have provided users with a way to sort by price, features, type, etc., so the proper urls would be:

www.example.com/page1.php?sortby=type&

G has added a bunch of different non-existent parameters after the? like:

www.example.com/page1.php?abc=xyz& which returns a basic page (with my standard header and logo) with no data, i.e. and empty page.

Hundreds of these!
If you have a solution which returns a 404 to non-existent parameters, please help!

jd01




msg:734762
 9:35 am on Sep 27, 2005 (gmt 0)

You need to have your connection at the top of the page, and make sure you do not output any information before you check to see if there is a result. (If you output information, you can no longer set the header.) Then you can use something like this:

$query="QUERY STUFF";
$result=mysql_fetch_array($query);
if(!isset($result) OR empty($result)) { header("HTTP/1.0 404 Not Found"); exit(); }

This will simply return a 404 and an empty page if there is no result for a query. (I check for a specific condition in the DB, so it may not be 'cut and paste' ready, but is close.)

Hope this helps.

Justin

Added: You can also set the header based on missing variables: EG if(!isset($_GET['yourvariable']) ¶¶ !isset($_GET['anothervariable']) ¶¶ !isset($_GET['somethingelse'])) { header("HTTP/1.0 404 Not Found"); exit(); }

The basics are you need to check at the beginning of the script, and it has to be for something that *should* be there. If it is not, then serve the 404. You should also check to see if the variables are empty().

g1smd




msg:734763
 7:03 pm on Sep 27, 2005 (gmt 0)

if(!isset($_GET['yourvariable']) ¶¶!isset($_GET['anothervariable']) ¶¶!isset($_GET['somethingelse'])) { header("HTTP/1.0 404 Not Found"); exit(); }

What happens to a user-agent that requests a page that does not exist, using HTTP 1.1, and is sent the message HTTP/1.0 404 Not Found" in return?

jd01




msg:734764
 10:27 pm on Sep 27, 2005 (gmt 0)

It will still register a proper 404 -- if there is a concern, you can use the protocol of the request in the response. Something like:

if(!isset($result) OR empty($result)) { header($_SERVER['SERVER_PROTOCOL']." 404 Not Found"); exit(); }

Justin

Edited: Wording

g1smd




msg:734765
 2:10 pm on Sep 29, 2005 (gmt 0)

Hmm, another peek into the minutae of the inner workings:

A site online for 2 years with just 5 pages of information had all the files deleted in late July or early August, and sat with nothing on the server (except a standard 404 error message) for the next 6 weeks or so. I must also mention that the old site navigation pointed to another 4 or 5 files that never actually appeared online: those links were 404 for the whole two years that the old site was online.

Google had listed the published pages normally for all the time that the site had been online. These pages were deleted in late July or early August.

By mid-September the listed pages were all shown as Supplemental results. Additionally there was no longer a link to a cached copy and the green text no longer included the file size anymore (The "similar pages" link was still present though).

In mid-Septmber a new site was put on the server. It doesn't use the same file names as the deleted files had used, but several of the file names are the same as files that were mentioned in the old navigation of the old site but which had never gone online on the old site.

After about a week, the supplemental pages, representing the stuff deleted in July/August have now regained a "cached page" link pointing to a cache copy from July 1st, and the file size for those deleted pages is once again being reported in the green text.

As we have no intention of using those filenames, I have taken the opportunity to see what happens to those old links, by doing three experiments. One is to upload a file in place of an old file, the new file having a meta noindex tag (I assume this page will quickly drop from the list). Another is to upload a file that contains a link to the new index page, but without a meta noindex tag within (I assume that this will get indexed with the new minimalist content). Finally for another file I have uploaded nothing, and will let that URL continue to deliver a 404 as it has already done so for several months (I assume that this will eventually drop out, but may take a long time).

I find it interesting that when we reactivated the old domain with new files, that the Supplemental Results for the deleted files, results that looked like they were fading out, suddenly regained a cache from 11 weeks ago, and started reporting what the size of the old file was too...

stargeek




msg:734766
 12:24 pm on Sep 30, 2005 (gmt 0)

I have a page indexed, and in the supplemental results that has never existed on the site, in a directory that has never existed, and is currently delivering a custom 404, which is cached and indexed.

g1smd




msg:734767
 6:21 pm on Oct 2, 2005 (gmt 0)

There is a page that I looked at earlier in the year that no longer exists on a particular website. I wanted to look at that page again.

In the Google search results last week the page was shown as a URL-only Suppemental entry but no cache was available. The page isn't found at archive.org either. What to do?

Based on my post above, that activity on a domain might lead to the re-introduction of cached pages that were about to be "retired", I wondered if merely having new links pointing to the Supplemental page might make anything happen at all.

So I tried it. A link from a PR6 page, and one from a PR7.

After only 4 or 5 days the URL is still Supplemental, and still shown as URL-only, but the cache is now back and is dated June 2005.

In the SERPs it says that the page is over 800k in size... and it is, and all of it is there, cached by Google (and the single word search term that is used to find the page is 420k down the file!).

Google caches more than 100k sometimes. Google can index more than the first 100k sometimes. Google never throws old data away; they can bring back whatever they want, whenever they want to do so.

beren




msg:734768
 11:57 pm on Oct 2, 2005 (gmt 0)

I use custom 404s too.

Is there any danger that they will be intepreted as doorway pages? Given that they link to other pages on the site, but no other page links to the custom 404 page.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved