Using my testing keywords within the sector I watch, I see 18.104.22.168 display results which are similar to few other DCs. However, what use to be site #1 isn't there anymore, not only on 22.214.171.124 but on several other DCs too.
Call it minor re-shuffling, maybe ;-)
Earlier you stated...
>> "Supplemental Results cannot harm you if they are for URLs that are 404, or are for URL that are redirects?"
Well scrap that. Now they can. And very badly. <<
We were seeing googlebot crawl some very unusual url combinations on our site. We were seeing things like index.php/links/links/blog/index.php/portfolio?method=something
As you can see, this is a very unusual url. We crawled our entire website using various tools and could not re-create these kinds of urls on our site, but when you typed it in directly, it loaded a page with content minus css, images, etc. Needless to say, it was a duplicate content penalty screaming "come get me".
In order to combat this, we set up some commands in our index.php file that would re-direct anyone or anything to our home page and give a 404 header if they tried to access a URL that does not exist via our folder structure and it looks like this...
$to = 'firstname.lastname@example.org';
$subject = 'Improper Link';
$message = 'LINK: '.$_SERVER['PHP_SELF'].'?'.$_SERVER['QUERY_STRING']."\r\n";
$message.= 'REFERRER: '.$_SERVER['HTTP_REFERER']."\r\n";
$message.= 'HOST: '.$_SERVER['REMOTE_ADDR']."\r\n";
$message.= 'USER AGENT: '.$_SERVER['HTTP_USER_AGENT']."\r\n";
$headers = 'From: email@example.com' . "\r\n" .
'Reply-To: firstname.lastname@example.org' . "\r\n" .
'X-Mailer: PHP/' . phpversion();
header("HTTP/1.0 404 Not Found");
header("Status: 404 Not Found");
This will also email us any time someone or something tries to access a url that we have not set up. We thought it was due to some improper formatting of our httpd.conf file until we started checking other websites.
Slashdot is also prone to having these types of urls, as were some other major websites. I believe it is something in apache that will allow it. [slashdot.org...] (slashdot urls are pulling in their css sheets)
I can only assume that someone exploited this in our server and basically started to feed url combinations to googlebot that would result in a lot of supplemental/duplicate content issues.
We can not 301 with these urls as they do not exist per say, so we had no choice but to 404 them.
Are you now saying that it will cause me more trouble in the long run?
[edited by: classa at 8:25 pm (utc) on Oct. 17, 2006]
I am not completely sure, but what you have already done is mostly a good safe-guard against this problem.
The one thing that is not quite correct is the content that you say you show for a 404 error.
Never redirect to the home page, or show an exact duplicate of your home page, as the 404 error page. Doing that can confuse the heck out of Google.
Make the 404 page have some basic site navigation as the index page to get the user on their way to what they were probably looking for, but don't use a direct copy of, or the actual, root index page itself.
Do make sure that the final HTTP status code really is 404. Many sites use 302 or 200 when you see the error page; and that is always a problem.
We do have a 404 page with basic navigation. I misrepresented what the script is doing.
So, you believe we are doing what we need to do to combat this?
I edited my posting earlier with a sample url from slashdot that shows how it can be done. Check it, then hopefully one of the mods will remove it...
There are a lot of scams out there that exploit that ,creating links to phantom pages and there you are with a duplicate penalty ,there are some good tips how to avoid some kind of those tricks at
What about using a 410 status code for these phantom pages?
Its jumped back to results I'm seeing on the others now Reseller.
Well I'm seeing a bit of a roll back for some of our pages. I know they have been spidered and were showing a fresh cache date of 1-2 weeks old, but now some are back to Februrary. Also traffic down 18%
A site with tens of thousands of pages, and which had almost all of the URLs changed a few years ago, has a wierd effect.
I feel sure that the entire site was listed at non-www under the old URLs, and continued to be listed at non-www for the new URLs. The site did not have a non-www to www 301 redirect in place, and mainly linked to non-www as the root, or did not specify the domain when linking to root. Most internal links did not specify a domain at all.
A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file.
In the last few months, the site has had almost all non-www listings dropped, and the site is now almost all listed at the equivalent www URLs instead.
Last week, all of the hard-coded non-www internal links were changed to point to www (especially hard-coded links to root) instead, and a site-wide 301 redirect was also added from non-www to www too.
For URLs disallowed in robots.txt for at least the last year - these are URLs which have all been 404 for several years - these have been shown as URL-only entries for quite a while, but today a number of them now have a TITLE! This is the old page title from about three years ago. None have a snippet. None have a link to cache. Most of the entries that have gained a title also have the "similar pages" link back in place too.
The cached page for the www root did update to show the new version, the one that links to www root a few days ago, but today the cache date has dropped back to a time before the page ws changed and the redirect added, and it now shows the root www page as linking to non-www root again.
Where did that ancient title data reappear from, after so long? What triggered its reappearance?
My programmers and I have been watching the datacenter fluctuations for a few weeks now (still getting serps from 126.96.36.199), and we noticed something on one of them that seemed to be showing results for a website that was dropped from the 1st page of serps well over a year ago.
With no real scientific explanation, it seems like Google has been serving ancient serps from an old datacenter that they dusted off.
Sorry that I don't have an explanation, just wanted to mention that we have seen the same weird (ancient) results.
My default Google is showing results from 188.8.131.52. Is this old or new data?
That datacentre is only a few months old, or we think that it is only a few months old (those IP addresses have only been active for a few months), so I would guess that it is new data...
|A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file. |
Only a sitestep and correct me if I'm wrong: If you disallow a page in the robots.txt the spider will not visit it, and - as a consequence - KEEP it in the index, if it is already there due to prior indexation. (But this is rather a question than a statement)
I found the best way to remove a page from SE-indices is, to leave it on the web with a "noindex" metatag plus a simple static link to redirect any potential human visitor to the appropriate page.
This was all done long before anyone really understood what supplemental results and URL-only listings were actually all about; and hasn't been changeds since.
g1smd, I experienced a similar situation, with 301 non www to www, https to http and php extension to html (all put into place June 2006). The 301s took hold and were showing exclusively from July to end of September. As of the beginning of this month, most SERPs are php, and include non www and also https. Also, php, non www, and https URLs that had not been crawled by googlebot for months are again being crawled and following the 301s. Seems like my SERPs were rolled back to pre June SERPs
Anyone else seeing drastic changes for the better on 64.233.183?
All the other DC's have a good chunk of my pages in supplemental. This one has just pulled them all out of supplemental. I hope this spreads to the other DC's and fast!
184.108.40.206 is looking real nice today.
Take a look at 220.127.116.11 too - notice the lack of Supplemental Results.
For the first time EVER, Google has every single page of my site indexed exactly as I'd choose to index it myself, with not a SINGLE supplemental result.
Now if they'd only roll it out globally over the weekend.
Either way, I think something is brewing.
Those data centres are an absolute nightmare.
They may have taken pages out of the suplemental index but a number of dead pages on some of our sites that should be deleted or suplemetal are showing in it.
I on the other hand hope they dont roll this data out - its a mess
Sorry RichTC, but for myself, in my tiny niche, 18.104.22.168 is SWEET!
Roll it out NOW G, please!
Yes, we are noticing major flux for all of our watched keywords right now. Started perhaps 12 hours ago, weren't sure then if it was just a minor change or a fluke, but it is appearing more and more that there is a major change in progress here... Some good, some bad... But since it is in the middle of the flux, nothing is for sure until the dust settles, so not sure yet if we would consider this a positive or a negative. Past experience has been that big moves happen over the course of a weekend (This is Friday, so this matches up), and that it can take up to 48 hours for it to all be done...
Gotta agree with Oaktown - 22.214.171.124 is SWEET - keep it up Goog
edit: They're ALL rolling now! YeeeeHaaaawww as Howard Dean once put it
Looks like the fairly widespread phenomenon of ineptly losing some pages for specific searches has returned, with the pages appearing as normal when &filter=0 is added as well as for other searches with different keyword combos.
I see that, as well as 7 different results for site:domain.com depending on which datacentre you hit (gfe-au and eh are best, kr is the worst) and the total page count going up for &filter=0 results on some.
It has been mentioned few times on this thread the terms "Data Refresh" and "Data Push". But do you know exactly:
whats the difference between Data Refresh and Data Push?
Of course not, I guess :-)
Well, our good friend Matt Cutts has the answer [mattcutts.com] :
....., they’re pretty close. A data push is a superset of a data refresh (that is, a data refresh is an instance of a data push). Typically a data refresh is something with a well-established history, e.g. we have automatic tests in place and the data to be sent out is sent out automatically (assuming that all the automatic sanity checks pass). A data push may tend to be less automated and thus may undergo more evaluation. But the terms are often used interchangeably by Googlers, because we know what we mean, so you’re pressing for a pretty fine connotation.
Thanks Matt. Much appreciated!
Way To Go Reseller! What a find!
Another useless re-post, of a useless post. He told you nothing. Without telling what the data sets consist of, you learned nothing, with no clues to anything.
Reseller what data is it? Is it true pr, is a list of spam sites to be removed, or penalized, is it a list of authority domains, is it a list of preferred lunch items for the food bar? You do not know, he did not say, and without this information there is nothing.
Back to watching,
I'm seeing single PR0 pages with no links, pages OR content ranking high on very competitive phrases.
Something is not right... that's for sure!
But what about a data pop?
Sorry just had to ask.
System: The following message was spliced on to this thread from: http://www.webmasterworld.com/google/3135191.htm [webmasterworld.com] by tedster - 11:51 pm on Oct. 25, 2006 (EDT -4)
Looking at the top of each Class C
I see movement on:
The results on these datacenters:
and on these:
are similar (for my terms, at least).
They are also distinctly different from these:
Do you see anything interesting on this set of DCs?
Try your testing keywords/ key phrases and tell us what you see.
Thanks a bunch!
jwc would like at least one of those ip addys.
| This 175 message thread spans 6 pages: < < 175 ( 1 2 3 4  6 ) > > |