Forum Moderators: Robert Charlton & goodroi
does anyone know if 216.239.51.104 is being used as a test server?
I seem to see my site at #9 across ALL datacenters excepts for this one which has me fluctuating between #1 and #4.
[edited by: tedster at 11:16 pm (utc) on Oct. 3, 2006]
Earlier you stated...
>> "Supplemental Results cannot harm you if they are for URLs that are 404, or are for URL that are redirects?"
Well scrap that. Now they can. And very badly. <<
We were seeing googlebot crawl some very unusual url combinations on our site. We were seeing things like index.php/links/links/blog/index.php/portfolio?method=something
As you can see, this is a very unusual url. We crawled our entire website using various tools and could not re-create these kinds of urls on our site, but when you typed it in directly, it loaded a page with content minus css, images, etc. Needless to say, it was a duplicate content penalty screaming "come get me".
In order to combat this, we set up some commands in our index.php file that would re-direct anyone or anything to our home page and give a 404 header if they tried to access a URL that does not exist via our folder structure and it looks like this...
<?
$list=explode('/',$_SERVER['PHP_SELF']);
if(count($list)>2):
$location="http://www.urlremoved.com";
if(strlen($_SERVER['QUERY_STRING'])>0):
$location=$location."/index.php?".$_SERVER['QUERY_STRING'];
endif;
$to = 'scott@urlremoved.com';
$subject = 'Improper Link';
$message = 'LINK: '.$_SERVER['PHP_SELF'].'?'.$_SERVER['QUERY_STRING']."\r\n";
$message.= 'REFERRER: '.$_SERVER['HTTP_REFERER']."\r\n";
$message.= 'HOST: '.$_SERVER['REMOTE_ADDR']."\r\n";
$message.= 'USER AGENT: '.$_SERVER['HTTP_USER_AGENT']."\r\n";
$headers = 'From: info@urlremoved.com' . "\r\n" .
'Reply-To: info@urlremoved.com' . "\r\n" .
'X-Mailer: PHP/' . phpversion();
header("HTTP/1.0 404 Not Found");
header("Status: 404 Not Found");
header("location: ".$location);
endif;
This will also email us any time someone or something tries to access a url that we have not set up. We thought it was due to some improper formatting of our httpd.conf file until we started checking other websites.
Slashdot is also prone to having these types of urls, as were some other major websites. I believe it is something in apache that will allow it. [slashdot.org...] (slashdot urls are pulling in their css sheets)
I can only assume that someone exploited this in our server and basically started to feed url combinations to googlebot that would result in a lot of supplemental/duplicate content issues.
We can not 301 with these urls as they do not exist per say, so we had no choice but to 404 them.
Are you now saying that it will cause me more trouble in the long run?
[edited by: classa at 8:25 pm (utc) on Oct. 17, 2006]
The one thing that is not quite correct is the content that you say you show for a 404 error.
Never redirect to the home page, or show an exact duplicate of your home page, as the 404 error page. Doing that can confuse the heck out of Google.
Make the 404 page have some basic site navigation as the index page to get the user on their way to what they were probably looking for, but don't use a direct copy of, or the actual, root index page itself.
Do make sure that the final HTTP status code really is 404. Many sites use 302 or 200 when you see the error page; and that is always a problem.
We do have a 404 page with basic navigation. I misrepresented what the script is doing.
So, you believe we are doing what we need to do to combat this?
I edited my posting earlier with a sample url from slashdot that shows how it can be done. Check it, then hopefully one of the mods will remove it...
I feel sure that the entire site was listed at non-www under the old URLs, and continued to be listed at non-www for the new URLs. The site did not have a non-www to www 301 redirect in place, and mainly linked to non-www as the root, or did not specify the domain when linking to root. Most internal links did not specify a domain at all.
A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file.
In the last few months, the site has had almost all non-www listings dropped, and the site is now almost all listed at the equivalent www URLs instead.
Last week, all of the hard-coded non-www internal links were changed to point to www (especially hard-coded links to root) instead, and a site-wide 301 redirect was also added from non-www to www too.
For URLs disallowed in robots.txt for at least the last year - these are URLs which have all been 404 for several years - these have been shown as URL-only entries for quite a while, but today a number of them now have a TITLE! This is the old page title from about three years ago. None have a snippet. None have a link to cache. Most of the entries that have gained a title also have the "similar pages" link back in place too.
The cached page for the www root did update to show the new version, the one that links to www root a few days ago, but today the cache date has dropped back to a time before the page ws changed and the redirect added, and it now shows the root www page as linking to non-www root again.
Where did that ancient title data reappear from, after so long? What triggered its reappearance?
My programmers and I have been watching the datacenter fluctuations for a few weeks now (still getting serps from 216.239.51.104), and we noticed something on one of them that seemed to be showing results for a website that was dropped from the 1st page of serps well over a year ago.
With no real scientific explanation, it seems like Google has been serving ancient serps from an old datacenter that they dusted off.
Sorry that I don't have an explanation, just wanted to mention that we have seen the same weird (ancient) results.
A year ago a load of disallow statements were set up in the robots.txt file in order to delist duplicate content of many types. Additionally, all the old URLs that the site used to have, all of which had been 404 for more than a year, were also set to be disallowed in the robots.txt file.
Only a sitestep and correct me if I'm wrong: If you disallow a page in the robots.txt the spider will not visit it, and - as a consequence - KEEP it in the index, if it is already there due to prior indexation. (But this is rather a question than a statement)
I found the best way to remove a page from SE-indices is, to leave it on the web with a "noindex" metatag plus a simple static link to redirect any potential human visitor to the appropriate page.
Take a look at 209.85.135.103 too - notice the lack of Supplemental Results.
For the first time EVER, Google has every single page of my site indexed exactly as I'd choose to index it myself, with not a SINGLE supplemental result.
Now if they'd only roll it out globally over the weekend.
Either way, I think something is brewing.
It has been mentioned few times on this thread the terms "Data Refresh" and "Data Push". But do you know exactly:
whats the difference between Data Refresh and Data Push?
Of course not, I guess :-)
Well, our good friend Matt Cutts has the answer [mattcutts.com] :
....., they’re pretty close. A data push is a superset of a data refresh (that is, a data refresh is an instance of a data push). Typically a data refresh is something with a well-established history, e.g. we have automatic tests in place and the data to be sent out is sent out automatically (assuming that all the automatic sanity checks pass). A data push may tend to be less automated and thus may undergo more evaluation. But the terms are often used interchangeably by Googlers, because we know what we mean, so you’re pressing for a pretty fine connotation.
Thanks Matt. Much appreciated!
Another useless re-post, of a useless post. He told you nothing. Without telling what the data sets consist of, you learned nothing, with no clues to anything.
Reseller what data is it? Is it true pr, is a list of spam sites to be removed, or penalized, is it a list of authority domains, is it a list of preferred lunch items for the food bar? You do not know, he did not say, and without this information there is nothing.
Back to watching,
WW_Watcher
66.249.85.19
72.14.215.19
72.14.221.18
209.85.135.17
[66.249.85.19...]
[72.14.215.19...]
[72.14.221.18...]
[209.85.135.17...]
and on these:
[64.233.183.99...]
[64.233.183.104...]
[64.233.183.99...]
[64.233.183.104...]
are similar (for my terms, at least).
They are also distinctly different from these:
[216.239.53.99...]
[216.239.53.104...]
[66.102.7.99...]
[66.102.7.104...]
Do you see anything interesting on this set of DCs?
[64.233.171.99...]
[64.233.171.104...]
[64.233.171.107...]
Try your testing keywords/ key phrases and tell us what you see.
Thanks a bunch!