Datacenter Watch: 2006-10-02

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Datacenter Watch: 2006-10-02

andrewshim

10:21 am on Oct 2, 2006 (gmt 0)

< continued from: [webmasterworld.com...] >

does anyone know if 216.239.51.104 is being used as a test server?

I seem to see my site at #9 across ALL datacenters excepts for this one which has me fluctuating between #1 and #4.

[edited by: tedster at 11:16 pm (utc) on Oct. 3, 2006]

reseller

7:36 pm on Oct 17, 2006 (gmt 0)

outland88

Using my testing keywords within the sector I watch, I see 66.102.11.107 display results which are similar to few other DCs. However, what use to be site #1 isn't there anymore, not only on 66.102.11.107 but on several other DCs too.

Call it minor re-shuffling, maybe ;-)

classa

8:18 pm on Oct 17, 2006 (gmt 0)

g1smd,

Earlier you stated...

>> "Supplemental Results cannot harm you if they are for URLs that are 404, or are for URL that are redirects?"
Well scrap that. Now they can. And very badly. <<

We were seeing googlebot crawl some very unusual url combinations on our site. We were seeing things like index.php/links/links/blog/index.php/portfolio?method=something

As you can see, this is a very unusual url. We crawled our entire website using various tools and could not re-create these kinds of urls on our site, but when you typed it in directly, it loaded a page with content minus css, images, etc. Needless to say, it was a duplicate content penalty screaming "come get me".

In order to combat this, we set up some commands in our index.php file that would re-direct anyone or anything to our home page and give a 404 header if they tried to access a URL that does not exist via our folder structure and it looks like this...

<?
$list=explode('/',$_SERVER['PHP_SELF']);

if(count($list)>2):

$location="http://www.urlremoved.com";
if(strlen($_SERVER['QUERY_STRING'])>0):
$location=$location."/index.php?".$_SERVER['QUERY_STRING'];
endif;

$to = 'scott@urlremoved.com';
$subject = 'Improper Link';
$message = 'LINK: '.$_SERVER['PHP_SELF'].'?'.$_SERVER['QUERY_STRING']."\r\n";
$message.= 'REFERRER: '.$_SERVER['HTTP_REFERER']."\r\n";
$message.= 'HOST: '.$_SERVER['REMOTE_ADDR']."\r\n";
$message.= 'USER AGENT: '.$_SERVER['HTTP_USER_AGENT']."\r\n";
$headers = 'From: info@urlremoved.com' . "\r\n" .
'Reply-To: info@urlremoved.com' . "\r\n" .
'X-Mailer: PHP/' . phpversion();

header("HTTP/1.0 404 Not Found");
header("Status: 404 Not Found");
header("location: ".$location);
endif;

This will also email us any time someone or something tries to access a url that we have not set up. We thought it was due to some improper formatting of our httpd.conf file until we started checking other websites.

Slashdot is also prone to having these types of urls, as were some other major websites. I believe it is something in apache that will allow it. [slashdot.org...] (slashdot urls are pulling in their css sheets)

I can only assume that someone exploited this in our server and basically started to feed url combinations to googlebot that would result in a lot of supplemental/duplicate content issues.

We can not 301 with these urls as they do not exist per say, so we had no choice but to 404 them.

Are you now saying that it will cause me more trouble in the long run?

[edited by: classa at 8:25 pm (utc) on Oct. 17, 2006]

g1smd

8:23 pm on Oct 17, 2006 (gmt 0)

I am not completely sure, but what you have already done is mostly a good safe-guard against this problem.

The one thing that is not quite correct is the content that you say you show for a 404 error.

Never redirect to the home page, or show an exact duplicate of your home page, as the 404 error page. Doing that can confuse the heck out of Google.

Make the 404 page have some basic site navigation as the index page to get the user on their way to what they were probably looking for, but don't use a direct copy of, or the actual, root index page itself.

Do make sure that the final HTTP status code really is 404. Many sites use 302 or 200 when you see the error page; and that is always a problem.

classa

8:28 pm on Oct 17, 2006 (gmt 0)

g1smd

We do have a 404 page with basic navigation. I misrepresented what the script is doing.

So, you believe we are doing what we need to do to combat this?

I edited my posting earlier with a sample url from slashdot that shows how it can be done. Check it, then hopefully one of the mods will remove it...

toothake

9:39 pm on Oct 17, 2006 (gmt 0)

There are a lot of scams out there that exploit that ,creating links to phantom pages and there you are with a duplicate penalty ,there are some good tips how to avoid some kind of those tricks at
[webmasterworld.com...]

classa

10:04 pm on Oct 17, 2006 (gmt 0)

What about using a 410 status code for these phantom pages?

outland88

11:11 pm on Oct 17, 2006 (gmt 0)

Its jumped back to results I'm seeing on the others now Reseller.

Bewenched

3:49 am on Oct 18, 2006 (gmt 0)

Well I'm seeing a bit of a roll back for some of our pages. I know they have been spidered and were showing a fresh cache date of 1-2 weeks old, but now some are back to Februrary. Also traffic down 18%

g1smd

10:24 pm on Oct 18, 2006 (gmt 0)

A site with tens of thousands of pages, and which had almost all of the URLs changed a few years ago, has a wierd effect.

I feel sure that the entire site was listed at non-www under the old URLs, and continued to be listed at non-www for the new URLs. The site did not have a non-www to www 301 redirect in place, and mainly linked to non-www as the root, or did not specify the domain when linking to root. Most internal links did not specify a domain at all.

A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file.

In the last few months, the site has had almost all non-www listings dropped, and the site is now almost all listed at the equivalent www URLs instead.

Last week, all of the hard-coded non-www internal links were changed to point to www (especially hard-coded links to root) instead, and a site-wide 301 redirect was also added from non-www to www too.

For URLs disallowed in robots.txt for at least the last year - these are URLs which have all been 404 for several years - these have been shown as URL-only entries for quite a while, but today a number of them now have a TITLE! This is the old page title from about three years ago. None have a snippet. None have a link to cache. Most of the entries that have gained a title also have the "similar pages" link back in place too.

The cached page for the www root did update to show the new version, the one that links to www root a few days ago, but today the cache date has dropped back to a time before the page ws changed and the redirect added, and it now shows the root www page as linking to non-www root again.
Where did that ancient title data reappear from, after so long? What triggered its reappearance?

classa

2:16 pm on Oct 19, 2006 (gmt 0)

g1smd,

My programmers and I have been watching the datacenter fluctuations for a few weeks now (still getting serps from 216.239.51.104), and we noticed something on one of them that seemed to be showing results for a website that was dropped from the 1st page of serps well over a year ago.

With no real scientific explanation, it seems like Google has been serving ancient serps from an old datacenter that they dusted off.

Sorry that I don't have an explanation, just wanted to mention that we have seen the same weird (ancient) results.

b2net

7:25 pm on Oct 19, 2006 (gmt 0)

My default Google is showing results from 209.85.135.104. Is this old or new data?

g1smd

7:34 pm on Oct 19, 2006 (gmt 0)

That datacentre is only a few months old, or we think that it is only a few months old (those IP addresses have only been active for a few months), so I would guess that it is new data...

Oliver Henniges

12:02 pm on Oct 20, 2006 (gmt 0)

A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file.

Only a sitestep and correct me if I'm wrong: If you disallow a page in the robots.txt the spider will not visit it, and - as a consequence - KEEP it in the index, if it is already there due to prior indexation. (But this is rather a question than a statement)

I found the best way to remove a page from SE-indices is, to leave it on the web with a "noindex" metatag plus a simple static link to redirect any potential human visitor to the appropriate page.

g1smd

12:43 pm on Oct 20, 2006 (gmt 0)

This was all done long before anyone really understood what supplemental results and URL-only listings were actually all about; and hasn't been changeds since.

jobonet

1:11 pm on Oct 20, 2006 (gmt 0)

g1smd, I experienced a similar situation, with 301 non www to www, https to http and php extension to html (all put into place June 2006). The 301s took hold and were showing exclusively from July to end of September. As of the beginning of this month, most SERPs are php, and include non www and also https. Also, php, non www, and https URLs that had not been crawled by googlebot for months are again being crawled and following the 301s. Seems like my SERPs were rolled back to pre June SERPs

rden17

4:32 pm on Oct 20, 2006 (gmt 0)

Anyone else seeing drastic changes for the better on 64.233.183?

All the other DC's have a good chunk of my pages in supplemental. This one has just pulled them all out of supplemental. I hope this spreads to the other DC's and fast!

JackR

4:57 pm on Oct 20, 2006 (gmt 0)

64.233.1.183 is looking real nice today.

Take a look at 209.85.135.103 too - notice the lack of Supplemental Results.

For the first time EVER, Google has every single page of my site indexed exactly as I'd choose to index it myself, with not a SINGLE supplemental result.

Now if they'd only roll it out globally over the weekend.

Either way, I think something is brewing.

RichTC

5:03 pm on Oct 20, 2006 (gmt 0)

Those data centres are an absolute nightmare.

They may have taken pages out of the suplemental index but a number of dead pages on some of our sites that should be deleted or suplemetal are showing in it.

I on the other hand hope they dont roll this data out - its a mess

oaktown

5:41 pm on Oct 20, 2006 (gmt 0)

Sorry RichTC, but for myself, in my tiny niche, 209.85.135.103 is SWEET!

Roll it out NOW G, please!

helpnow

6:02 pm on Oct 20, 2006 (gmt 0)

Yes, we are noticing major flux for all of our watched keywords right now. Started perhaps 12 hours ago, weren't sure then if it was just a minor change or a fluke, but it is appearing more and more that there is a major change in progress here... Some good, some bad... But since it is in the middle of the flux, nothing is for sure until the dust settles, so not sure yet if we would consider this a positive or a negative. Past experience has been that big moves happen over the course of a weekend (This is Friday, so this matches up), and that it can take up to 48 hours for it to all be done...

rden17

6:35 pm on Oct 20, 2006 (gmt 0)

Gotta agree with Oaktown - 209.85.135.103 is SWEET - keep it up Goog

edit: They're ALL rolling now! YeeeeHaaaawww as Howard Dean once put it

steveb

10:17 pm on Oct 20, 2006 (gmt 0)

Looks like the fairly widespread phenomenon of ineptly losing some pages for specific searches has returned, with the pages appearing as normal when &filter=0 is added as well as for other searches with different keyword combos.

g1smd

10:32 pm on Oct 20, 2006 (gmt 0)

I see that, as well as 7 different results for site:domain.com depending on which datacentre you hit (gfe-au and eh are best, kr is the worst) and the total page count going up for &filter=0 results on some.

reseller

5:34 am on Oct 23, 2006 (gmt 0)

Hi Folks

It has been mentioned few times on this thread the terms "Data Refresh" and "Data Push". But do you know exactly:

whats the difference between Data Refresh and Data Push?

Of course not, I guess :-)

Well, our good friend Matt Cutts has the answer [mattcutts.com] :

....., they�re pretty close. A data push is a superset of a data refresh (that is, a data refresh is an instance of a data push). Typically a data refresh is something with a well-established history, e.g. we have automatic tests in place and the data to be sent out is sent out automatically (assuming that all the automatic sanity checks pass). A data push may tend to be less automated and thus may undergo more evaluation. But the terms are often used interchangeably by Googlers, because we know what we mean, so you�re pressing for a pretty fine connotation.

Thanks Matt. Much appreciated!

WW_Watcher

2:04 pm on Oct 23, 2006 (gmt 0)

Way To Go Reseller! What a find!

Another useless re-post, of a useless post. He told you nothing. Without telling what the data sets consist of, you learned nothing, with no clues to anything.

Reseller what data is it? Is it true pr, is a list of spam sites to be removed, or penalized, is it a list of authority domains, is it a list of preferred lunch items for the food bar? You do not know, he did not say, and without this information there is nothing.

Back to watching,
WW_Watcher

petehall

2:55 pm on Oct 23, 2006 (gmt 0)

I'm seeing single PR0 pages with no links, pages OR content ranking high on very competitive phrases.

Something is not right... that's for sure!

theBear

3:54 pm on Oct 23, 2006 (gmt 0)

But what about a data pop?

Sorry just had to ask.

SEOcritique

3:13 am on Oct 26, 2006 (gmt 0)

System: The following message was spliced on to this thread from: http://www.webmasterworld.com/google/3135191.htm [webmasterworld.com] by tedster - 11:51 pm on Oct. 25, 2006 (EDT -4)

Looking at the top of each Class C
I see movement on:

66.249.85.19
72.14.215.19
72.14.221.18
209.85.135.17

lfgoal

1:57 pm on Oct 26, 2006 (gmt 0)

The results on these datacenters:

[66.249.85.19...]
[72.14.215.19...]
[72.14.221.18...]
[209.85.135.17...]

and on these:

[64.233.183.99...]
[64.233.183.104...]
[64.233.183.99...]
[64.233.183.104...]

are similar (for my terms, at least).

They are also distinctly different from these:

[216.239.53.99...]
[216.239.53.104...]
[66.102.7.99...]
[66.102.7.104...]

reseller

10:00 pm on Nov 4, 2006 (gmt 0)

Hi Folks

Do you see anything interesting on this set of DCs?

[64.233.171.99...]
[64.233.171.104...]
[64.233.171.107...]

Try your testing keywords/ key phrases and tell us what you see.

Thanks a bunch!

This 175 message thread spans 6 pages: 175