| This 37 message thread spans 2 pages: 37 (  2 ) > > || |
|Phantom Pages Indexed|
Does a Duplicate Content Penalty Loom?
My site has about 1350 pages. Using a site:www.mysite.com search Google shows 4450.
That's been happening for the last 6 months or so, but not consistently. Occasionally Google will show a number that is reasonably close to accurate. (I'm assuming that's probably an issue of which data center the search is drawn from)
Now, yesterday, prompted by a comment in another thread, I did a site:mysite.com search and it came up with 4515 pages.
Also I've noticed a growing number of URL only listings when I do the site: type searches. Again, the number is not consistent.
When Google returns the 4400+ number of pages, the number of URL only listings is also much larger.
Initially the URL only listings showed up for only bottom level, low traffic pages. Recently the URL only listings began to include higher traffic, higher level pages.
A year or so ago I deleted about 1,100 pages, and made the folders they had been in noindex, nofollow. I needed to keep the folders because of other content (images) that still are in those folders.
I also deleted a couple hundred other pages and replaced them with pages that redirected (meta refreshed actually) to my home page. I've since deleted those couple hundred pages.
Does anyone have an idea of why Google is showing so many phantom pages?
Am I likely headed for, or currently suffering, a duplicate content penalty?
If so, how do I avoid or recover from it?
Possible solutions I'm considering
Creating a 301 redirect from mysite.com to www.mysite.com.
[I don't know much about 301 redirects so I searched around WW and found this thread
An Introduction to Redirecting URLs on an Apache Server [webmasterworld.com]
by DaveAtIFG. I'm hoping I can understand enough of that to work my way through doing a 301 redirect if that's the best course of action.]
Because the issue seems to be intermittent, I don't know if I actually have to do anything?
I'm hoping for some guidance, suggestions, or ideas on what to do if anything.
I would have three questions right off the bat:
Do you see any of those url-only pages having a cached page in Google and if so what is the date on the cached page (usually a very old date)?
Is there a difference in pages returned between using the site:domain.com with and without the www. before the domain name (ex. site:www.domain.com and site:domain.com)?
Last - Have you verified that the urls shown in the site command are all really listed as on your domain or do some show as other sites that link to you with some sort of redirect or possibly even a copy of your pages?
Actually - 4th question - do any of the pages have "supplemental result" appended to the listing in the site:command?
Not an expert but this sounds similar to what happened to me, the complicating factor being I do have some 'content' that can be found elsewhere.
I would certainly do the 301 redirect from ursite.com, to www.ursite.com
If your site is database driven/dynamic:-
I would check very carefully what happens when you delete a page / an item in the database is deleted or renumber the categories. Is google coming for pages that don't exist now and what does it get given?
Check that pages with variables in the url are not being duplicated by the variable changing position in your code or being ommitted.
e.g. page.php?cat=10&page=1 could be the same as page.php?cat=10 or page.php?page=1&cat=10
If you have variables like the above it would be best to rewrite these urls using mod rewrite if you need to and use 301 redirects on the old versions.
I actually asked about this, but was never posted. I wanted to know if the deleted pages
(via [services.google.com:8882...] ) are really deleted or just hidden from users. I deleted almost every page via the above just to make sure months ago and changed the path.
I "have" over 12000 pages according to google. Only 1500 or so are valid though in that site. Does google keep the numbers to inflate it's index (hey, we hve 8 billion) or...?
domain.com has been 301d to www.domain for 4-5 months and no links shows under domain.com
Thanks for the bump Brett.
Since I originally wrote this post I have created the 301 redirect with the help of JdMorgan and others in the Apache Forum.
From other reading and suggestions here I may also need to make a change in my internal url structure.
I constructed the site using relative links. I've been advised that I probably should change at least all the internal links to my main pages to absolute links.
|Do you see any of those url-only pages having a cached page in Google and if so what is the date on the cached page (usually a very old date)? |
They have no cached page showing.
|Is there a difference in pages returned between using the site:domain.com with and without the www. before the domain name (ex. site:www.domain.com and site:domain.com)? |
There are more pages (about 100, but it varies a bit from time to time) listed in the non-www. search.
|do some show as other sites that link to you with some sort of redirect or possibly even a copy of your pages? |
There are a handful of off-site redirecting to my site.
|do any of the pages have "supplemental result" |
|If your site is database driven/dynamic:- |
My site is 100% static pages.
Thanks for your comments and questions everone.
I definitely agree with the change to the linking structure (I have done the same thing and it definitely helped), as well as the 301s youve put up. Will be interesting to see how long it takes to change things, as Ive seen it take anywhere from 3 days to 6 weeks recently. I believe a lot of it depends on how much you get botted every day.
One other thing to check in your server config just to make sure what you are doing with the 301 will help:
check and see if your server config has
the servername as www.domain.com or domain.com
One other check is to make sure that any directory paths you have set up (that you are changing to absolute paths) have the trailing slash on the directory (ex. www.domain.com/subdir/ instead of www.domain.com/subdir) the reason for that is to prevent an additional 301 being performed by the server and possibly Google interpreting them as two different directories - which has been reported lately by some.
Well, I've had the 301 up for about a week now.
Just did site: and site:www. searches. All pages on both searches are now showing the site as www.
But both searches are still showing the phantom pages, although the number is now the same on both searches, 4530 pages.
I'm working on changing the internal link style from relative to absolute. That's turning out to be harder than I thought it would be.
>> Does anyone have an idea of why Google is showing so many phantom pages?
Googlebot has gotten "better" at identifying pages. Also, the index seems to contain pages that are long gone, as well as "pages" that never actually have been pages. Both of these things show that G's is larger than M's and Y!'s. Other than that it seems it's potentially harmful to webmasters, as G can't always tell the difference between real pages and the rest.
Also - the total number of pages for any search is just an estimate, and this estimate is sometimes totally unrealiable. >> Am I likely headed for, or currently suffering, a duplicate content penalty?
I hate to cry wolf, but "likely" to "likely headed for". If it's just G's estimate that's way off, then no problem - if G confuses your pages, then there's potentially a problem. I don't know if you are suffering - how's your traffic and rankings? >> If so, how do I avoid or recover from it?
Go over the pages you can find in the Google index. Make notes of the ones that seem odd to you in any way and inspect those pages manually. Use a server header checker, and make sure they return a "200 OK" code (or perhaps "304 Not Modified"). If Google has pages listed as belonging to your site, but they are not there, request a removal.
Then, go over your site with a fine tooth comb. Even try to break it by entering URLs that you know isn't there, and/or url's that look a lot like those that are there (eg. ".htm" in stead of ".html" or "index" in stead of "index.php", or "filename.htm/foldername/filename.htm"). Whenever you manage to get a page shown on an URL that shouldn't show anything, close that hole.
Even though all this might seem extremely complicated and a lot of work, there's a chance that Google just can't calculate the number of pages correctly and you're allright anyway.
First, thanks for your post and please excuse my delay in responding.
I've now completely converted all my internal page links from relative to absolute, what a nightmare! I didn't convert my image links though. I'm not sure if that's needed.
Total Pages Shown
Both site:www.domain... and a site:domain..com... searches are still showing 4540 pages, all sowing the www.domain.. urls. Looking at the first thousand listings, about 20% are url only, in both search versions.
I don't know what this means, but when I add a specific term to the search ( +"term") the results show a page count that is pretty close to accurate, ...
with NO url only listings shown.
I do have to do more than one search like that to cover all my site because I'm using +terms that are exclusive to various sections of the site. But add the numbers from the various "section related searches" together and theyadd up to close to the number of pages I actually have.
Server Header Checker
OK, here's where my knowledge level gets exceeded. :) I've never done this. Is this something I can do online? I'm off to see if I can figure this out.
OK, I found a server header checker, that was easy enough.
I'm running some of the url only listings through it now.
So I've now run all the url only listiongs I could find through a handy dandy server header checker.
They all came back as "Status: HTTP/1.1 200 OK"
One thing did pop out at me in the process though. So I dug around the site a little. In an earlier post I mention having deleted about 1,100 pages a year or so ago. I also said I made their directories (folders) noindex/nofollow.
That was wrong apparently. I really had simply used my robots.txt file to exclude the SEs from the main folder the pages were in. That means that there are 190 or so other pages remaining in that set of folders that had the urls indexed, but not the contents.
About 70+/- of the url only listings showing in the first 1,000 were some of those pages. That seems to be about 1/3 of all the url only listings in the first 1,000.
Traffic and Rankings
Claus asked about my traffic and rankings. Both are with what seems to be a normal range for my site at this time. I got a fairly big boost in traffic (for me) after Googles late December update and that seems to be holding fairly well at the moment.
re: the internal linking that was mentioned.
relative = /page.html
absolute = [domain.com...]
Relative ../page.htm or even ../../page.htm each ../ equals a folder.
<<Bumping to see if anyone has any input?>>
Unfortunately, yes, I have some input for you...one of my sites just hit 35,000+ pages (3 times the actual count), and the bad news is that the main money pages have been dropping out of the serps one by one for 2 months, now. The REALLY bad news is that other money pages are going gray-bar. These are pages that have ranked highly for years and have 1,000+ back links.
Each week more and more pages go title-less.
I originally thought the problem was occuring because some dork (me) forgot to disallow: /cgi-bin/ and one of my scripts was spitting-out all the excess pages. But it doesn't appear that's the case.
And what's funny is that I can never find ANY of these 28,000+ excess pages in a site:domain check...not in Google, Yahoo, ATW...NO WHERE...not one.
The only thing I can come up with is that some of these pages have a tremendous number of scrapers linking to them...(one has 5,900+)... and Google is simply crumbling under the stress of all that garbage it has to wade through.
It's either that or someone has been sticking pins in a jk321 voodoo doll.
[edited by: jk3210 at 5:44 am (utc) on Mar. 8, 2005]
<<1350 / 4400>>
I wonder if there's any significance to the fact that both of these sites (yours and mine) are returning approximately 3 times the actual page count?
During early mornings, my site has 901 pages according to Google. During the evenings this number is 4900+.
In reallity my site has something like 1000+ valid pages.
The listings are jampacked with "complemental results" as they should and it's all my own fault. This is the site where I learn, and I have learned lots of lessons.
I have used dynamic subfolders that has resulted in possible duplication (or triple, or even 7 identical pages with different urls) of pages.
I have, during the course of the years changed the name of my index.php pages that is actually used for all pages on the site. I have changed its name 4 or 5 times and all these urls are still valid.
You know, I'm actually happy that at least one of the indexes show something that resemples an accurate result!
First, i really have to say that i'm pretty annoyed that the webmaster task is slowly getting more and more to the point of constantly having to check, double-check, and edit server settings and pages (as well as monitoring Search Engines for errors).
So, a lot of time has to be spent on tasks that are really irrelevant to the users of the site, only to keep the SE bots happy and stay out of trouble. Also, most webmasters are simply not technically skilled to do all this stuff (let alone doing it constantly and consistently.) >> I really had simply used my robots.txt file to exclude the SEs from the main folder
ken_b you nailed it! congrats :) These issues can be hard to identify as you tend to forget those URLs. The spiders don't forget, however.
Now, for everybody else reading this, i'll just emphasize it: Exclusion by robots.txt can lead to URL only listings
If you want to exclude a folder by robots.txt you should do so before the contents of the folder has been spidered. After the contents of the folder has been spidered, adding a robots.txt exclusion will only mean that Googlebot can't access some files that it knows are there. So, they turn "URL only".
Google does not remove URLs that are in the robots.txt file once these URLs are in the index. Googlebot just stops spidering them. If you want them removed, you have to specifically "ask for this".
So, you have two options:
Try the Google URL removal tool [google.com] first - there's a few different methods to choose from, and in my experience this is quick and it works. If you already have that folder in your robots.txt you can actually use this file to get the URLs removed.
I should add that i have only used it for individual URL's, not whole folders, but i have no reason to doubt that it also works when removing a whole folder. The tool does not allow for removal of a lot of individual URLs if you use the robots.txt file. There's a limit to the size of the robots.txt when using the removal tool.
Put this meta code in the <head> section of each individual file:
<meta name="robots" contents="noindex,follow">
You can of course just make a blank page with this tag as well (and a "nofollow" tag if there are no links on it)
Put the meta code on those pages, and remove the "robots.txt" exclusion of the folder. Then wait for the pages to get spidered - you might want to make a nice page to Googlebot with nothing but links to those pages and submit that url [google.com] to Google. It's also possible that you can submit that URL using the URL removal tool. >> some of these pages have a tremendous number of scrapers linking to them...(one has 5,900+)...
If this is really the reason (and it does sound like a lot of duplicate content to me) then this is another case where Google mixes up the real page with "derivatives". This is a really bad thing for webmasters as there is usually no way we can track all these scrapers continously. Plus: We should really not have to care about that - SE's should just fix this!
If it was only a few you could file a DMCA complaint [google.com]. This takes time [webmasterworld.com] and it's a lot of work for so many pages. >> I have changed its name 4 or 5 times and all these urls are still valid.
Nikke, that's the danger of a dynamic site ;) As a webmaster, you should really not have to worry about this, but sadly you have to anyway, as it's exactly this kind of stuff that confuses the bots.
When I search for my domain name (not a "site:" search, just my domain name) the results look like a directory of scraper sites. I've not worried about that in the past because I thought that as long as I didn't link back to those sites they couldn't hurt me. I hope I'm not wrong about that. And a lot of those pages are just unlinked text used for the keyword density I guess, like bare text copies of a Google serps page.
This is todays task. I think I'll add this to all the pages that still exist in this section.
Noindex/nofollow as Related to Deleted Pages
This is where I could have another part of the issue hidden. I've mentioned earlier having deleted about 1,100 pages from the site.
Those pages are gone from the site, and I haven't seen them in the serps or showing up as 404s in my logs, but it sounds like they could be lingering around in the G database somwhere and showing up in the phantom page count.
Unfortunately I no longer have a list of the old urls laying around.
Thanks for your help and comments folks.
|I had the same problem with Google retaining pages after I had deleted them. I'd been adding the now empty directory to robots.txt after deleting the pages (duh!) meaning Google would ignore the directory and so be unable to find out that the pages had indeed gone.|
So I removed the directory from robots.txt and added
RewriteRule ^.*$ - [G]
to an htaccess file in the directory which will return a '410 Gone' response. As a result the pages were removed from the google index within a couple of weeks without having to submit them to Googles removal page.
If you want to remove pages from a directory, but keep images in the same directory indexed, you could use
RewriteRule ^.*\.html$ - [G]
Carrot63; Thanks for your post it's much appreciated.
I just checked my logs for yestreday and see that Googlebot visited 1434 of my 1377 pages. [Not sure yet if that means I have a few stragller pages on the server, but as time allows, I'm hunting thru my folders to see.]
This is the first time in a long time that G has crawled so many pages at once. It'll be interesting to see how that affects me.
Most of my pages are "Evergreen", to use a term I picked up from EuropeforVisitors. I haven't made wholesale changes on the site for about a year.
Mostly the lower level pages only change when I insert a new page between a couple of old ones.
Are you able to see the urls of any of your phantom pages in Google or Yahoo?
Nope. In G the page count is so high I kind of assume they're hiding in the great beyond (past the 1,000 page cut off).
Y doesn't seem to count the phantom pages, thankfully.
<<Y doesn't seem to count the phantom pages, thankfully>>
Interesting --Yahoo *does* count mine.
Just to recap for anyone who's interested, here's the sequence of events my problem took:
11/2004- I noticed 3X the correct page count. During investigation, I found 5,000+ scrapers linking to index.html, with a few using tracker2.php.
-During Allegra the site lost rankings for 100s of peripheral terms and traffic began to drop.
2/24- Certain main pages (not the index page) went gray-bar.
2/28- Google stopped spidering.
3/1- Lost number 1 rank for all but one previously held terms.
3/4- Index page (previously #1 ranked for a $$$ term) dropped to 200+, AND its listed url went from "www.domain.com" to "domain.com"
3/7- A search for "www.domain.com" intermittently yields a "No information available" result.
3/8- Ordered two cases of Dewars.
Thats what I also saw before my site went down because of the googlejacking, I had real 2300 pages listed, then google said 4000 pages, then sometime later it went down every week, now 250 pages indexed.
The most interesting observations i've seen on google is that we had a template called detalsselect.cfm
with productID attached would look like this
on one of catalogue listings we have misspelled the CASE on the template
so now it was coded as
Both pages got indexed by Google it is the same page Dahhhhhh, why keep 2 copies?
we have 1000 products,
know we changed website site and just for instance for this productID, I went and submitted detalsselect.cfm?100 to be deleted fron the index.
5 Days Later got a response from google that it was successful.
that was 3 weeks ago.
today I found a cached copy of the second page DetalsSelect.cfm?100 in the index.
This is the same file, and is it a duplicate Content? Am I being not reasonable.
we too have the same problem, 4600 pages indexed in reality only 1500 or so.
ALSO, we use to have a different contact Phone Number for our website and know switched to 1-800 #
Today I've searched site:www.mysite.com +"old number" - and you would never believe what I found.
A link to a my domain name with a snippet and a link to the cached page. Now in the sniped it says the date Mar-26--2004, when I click on the cached page it shows me a cache page from Mar-16-2005
either the Amount of pages that google shows as indexed is bogus and their algorithm is worth -BOG ZILCH and they are inflating the numbers or I am a citizen of the planet MARS
First, i really have to say that i'm pretty annoyed that the webmaster task is slowly getting more and more to the point of constantly having to check, double-check, and edit server settings and pages (as well as monitoring Search Engines for errors).
So, a lot of time has to be spent on tasks that are really irrelevant to the users of the site, only to keep the SE bots happy and stay out of trouble. Also, most webmasters are simply not technically skilled to do all this stuff (let alone doing it constantly and consistently.)
(two paragraphs of mumbling and cursing deleted)
This is a wonderful thread and a topic which has really been bugging me for a long time.
Claus ... many thanks! I've asked before about a specific url, but nobody ever mentioned doing a header checker and I didn't know enough to do it.
On the first page of results for site:mysite.com I found a weird page (phantom page which doesn't belong there) which I have reported to Google ... but nothing has been done. It shows like this:
Please note the space after the second /
To do the header checker, I removed the space and got ...
Status: HTTP/1.1 302 Found
Date: Sat, 19 Mar 2005 12:05:00 GMT
Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) PHP/4.3.0
Keep-Alive: timeout=15, max=100
Content-Type: text/html; charset=iso-8859-1
So now what do I do? I have already reported this to Google with no results ... but at that time, I didn't know that page had a 302 redirect. Do I report it again with this new information and hope they do something, or is there something I can do now that's safe and won't kill my site altogether?
<added> Could explain the sudden drop in traffic?
>> Server header checker
Just for those that haven't tried one of these: If you go to your Control Panel here at WebmasterWorld, you will see a link on the left saying "Server Headers" - that's it.
>> So now what do I do?
Liane, that one's an easy one, fortunately. Well, sort of... (YMMV)
First, the URL in Google with an extra space is not the right URL. Google adds spaces in long URLs in the serps, so that they don't break the display on very small screens. The real URL (the one that Google has indexed) is simply the same one, just without the space.
Second, the real URL exists on your site, at least as a 302 redirect to your front page. I don't know if there is also a physical page - perhaps you have taken it down a while ago and replaced it with the redirect.
So, what you should do is:
- First, remove the 302 redirect from that URL to your front page. It is probably found as a rule in your .htaccess file if you are on a Apache server.
- Second, put up a real html page on that URL. You don't need a lot of content (if any) - all you need is to put this in the file:
<meta name="robots" value="noindex">
- Third, go to the Google URL removal tool [google.com] and request to get that specific page removed
That should take care of it. For that URL, at least.
Preventing future errors
Please, do check your .htaccess file to make sure you don't redirect other URLs to your home page - especially not using 302's. Doing this spells "trouble". The only URL that should redirect to your homepage is the version without www in front of it (or, the one with www if you prefer that your domain does not have www in it - some do) - and only using a 301, not a 302.
(if you have "vanity domains" -- eg. shorter versions, spelling errors and such -- these should of course also redirect to the main domain using a 301)
For all other URLs that you don't know where to point, i suggest redirecting to your site map in stead, and only with 301 redirects.
How to identify a 302
This is a 302:
RewriteRule .* [example.com...] [R,L]
This is a 301:
RewriteRule .* [example.com...] [R=301,L]
Note that the difference is only the "=301" part after the "R". Your rules might look a little different. In general, you can safely assume that if there is no "=301" then it is not a 301 redirect. The [R=301] will make them become 301's, so add/change "=301" in any rule that does not have it.
Claus, you've helped a lot of people. I know you've been reporting on this for a long time now, even back when no one was listening.
And I want to reiterate to folks that this is a GOOGLE caused problem. No one should have to completely alter every one of their websites because of this. There is nothing wrong with 302's, its all broken GOOGLE's fault.
--Kenn (and if it wasnt for us webmasters, there would be no google... Google, the worlds biggest 'scraper site', was born off the backs of webmasters. Now they break the backs of webmasters.)
I've had a similar problem with my site for a year now. Google indexes abour 70% of my dynamic pages as URL-only and most of my html pages are listed as supplemental results. I have had some redirect code in my apache httpd.conf file redirecting all SE's to www.mysite.com over a year ago and I changed all of my internal page links to absolute months ago and it didn't really seem to change anything. My site went from having duplicate pages for www.mysite.com and mysite.com to now having nothing but supplemental pages. If I run a site:www.mysite search I get 3800 pages vs 3500 for a site:mysite.com search. Is that significant? My robots.txt file doesn't disallow anything...should I disallow my cgi-bin or other folders?
| This 37 message thread spans 2 pages: 37 (  2 ) > > |