Forum Moderators: open

Message Too Old, No Replies

My staging server is indexed!

I think Google is penalizing for duplicate content.

         

threecrans

7:19 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



Currently, our entire website is accessable via the Internet two places:

  • Our "main server". The server to which we _want_ all of our traffic to go.
  • A "staging server" which contains 99% identical content to the "main server"...I don't want traffic to go here at all.

The problem is, our "staging server" hosted a lot of website content at one time. A lot of links to this server are still around from various Internet locales and I have been unsuccessful in eradicating these links. As a result, the "staging server" is often indexed by Google. Today I noticed a PR0 on one of the "staging server" pages...and I can only attribute this to Google thinking I am duplicating content. So, here are my options:

  1. Completely remove the "staging server" from the Internet. I don't want to do this because we still get a fair amount of traffic to this server and I don't want them to get a DNS error.
  2. Disallow all pages in robots.txt. I am afraid to do this because the "staging server" and the "main server" share all content (all pages are published from "staging" to "main") including the robots.txt file.
  3. Redirect all traffic from "staging" to "main" (with a 302 response code). This seems like the best option. How will Google react to this though? Will it be apparent what I am doing? Any chance of a penalty from doing this?

I'm probably just being paranoid...but hoping I get a couple of backups on option 3...Thanks.

Knowles

7:30 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



Put a robots.txt on the server blocking all spiders. Forgot to add, just change the publishing to not publish the robots.txt file so that you dont end up blocking on the main site. The only other option I can see would be to take the staging server offline and only have it local.

threecrans

8:11 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



I am very cautious about using a robots.txt file on the staging server. I could change the project settings (I use Visual Interdev) so the file is not published to the main server but I am afraid of what would happen were it accidently published...so I would like to not do this option unless it is my only option. Anyone else who has had synchronization issues with Interdev would probably understand why.

I would rather take the staging server completely offline than run the risk of being penalized by Google or inadvetantly telling Google to not index my site.. Hoping I don't have to do that.

You didn't mention anything about option #3. It is your opinion that this isn't a feasible option. If so, why?

[edited by: threecrans at 8:43 pm (utc) on Aug. 27, 2002]

Knowles

8:14 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



If you do #3 how do you view the staging site? I assume this is up for you to test things. If you do a 302 you will send yourself to the main server instead of seeing the staging server which would make it pointless. Unless excluding your IP out of the .htaccess but not that I am aware of.

threecrans

8:27 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



If you do a 302 you will send yourself to the main server instead of seeing the staging server which would make it pointless.

True, it does make it kind of pointless. But there are advantages.


  1. No chance of inadvertantly publishing a robots.txt file telling Google to ignore my entire server.
  2. Visitors will not get a DNS error.

Also, I think I have a solution to allow it to still be internally viewable.

main server name --> "main.com"
staging server name --> "staging.main.com"

The site is an ASP site, so I check the Request.ServerVariables("SERVER_NAME") in the Session_OnStart event something like:


' Check to see if the web server is staging server
if instr(Request.ServerVariables("SERVER_NAME"), "staging.main") <> 0 then
' It is staging. Now to redirect to main
Response.Redirect("http://www.main.com" & Request.ServerVariables("PATH_INFO"))
end if

I should be able to view internally still by using the internal name which is not the same as Request.ServerVariables("SERVER_NAME"). ASP people, with this methodology is there anything I am not accounting for?

JayC

8:28 pm on Aug 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A robots.txt on the staging server, excluding all spiders, is the best way to be sure that content doesn't get indexed. Perhaps you could get around the risk of that robots.txt being inadvertantly present on the live server by setting up a cron job to check on it? Maybe overwrite robots.txt every hour with a copy placed in a different directory, existing only on the live server where you know that synchronization won't affect it...

threecrans

8:39 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



Perhaps you could get around the risk of that robots.txt being inadvertantly present on the live server by setting up a cron job to check on it?

And give up any chance of a decent nights sleep for the rest of my life? :)

It's living a little close to the edge...but not a bad idea.

So that's two votes against the 302 redirect. What is the reasoning though? What will Google (or any other Spider) not like about this?

jatar_k

8:47 pm on Aug 27, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



I have used a setup like this and had the two servers configured differently.

You needed a user and pass to access the dev server.

our dev server was also setup with every domain as a sub of ours. So the live corp site was www.company1.com but the dev site was company1.ourdomain.com and required user and pass setup via .htaccess. Dev sites were never spidered and when I moved things to the live server it was only the server itself that was different to provide access.

just a thought.

PaulPaul

8:54 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



I totally agree with jatar_k.

Just change the Staging Server access rights, give yourself, and whoever else needs it a username and password. And your done. You can update your live server, no problem. And since googlebot doesnt have the username and password, it will not get in, and thus not index the site.

BTW: How does google know of the site?? Do you have links pointing to the staging server?? Or did you submit it to SE's??

Knowles

8:58 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



The problem is, our "staging server" hosted a lot of website content at one time.

Since its on IIS how do you set up the 302 or the user/password? you can't use .htaccess can you?

Also since its on IIS the cron job wont work, there might be an alternative for windows though.

threecrans

9:10 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



jatar_k: You needed a user and pass to access the dev server.

That's a great idea. The only drawback is users will not be able to get to the main server...but at least it stops the indexing.

PaulPaul: BTW: How does google know of the site?? Do you have links pointing to the staging server?? Or did you submit it to SE's??

It used to host a portion of our site...which has since been moved to the main site. As a result there are quote a few links on the Internet pointing to the staging server

jatar_k

9:14 pm on Aug 27, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



sorry my examlpe above was on solaris servers and I don't know the IIS equivalents.

What about having a redirect from the old staging domain to the live site and applying a new sub domain to the actual staging server. You then keep all the traffic presently going to it and you get a clean staging server.

dev.main.com or some such, could be anything. Then you can set up the server to keep people out.

threecrans

9:54 pm on Aug 27, 2002 (gmt 0)

10+ Year Member



What about having a redirect from the old staging domain to the live site and applying a new sub domain to the actual staging server.

Another very good suggestion. This is probably what I will do if no better alternative arises.

Please! Can someone explain to me, why is a 404 or a DNS error better than a 302?

Slade

12:15 am on Aug 28, 2002 (gmt 0)

10+ Year Member



Can't you just take your robots.txt out of your main publishing store, then create a dev-publish, and real-publish, to put them in?

(Might be a long way around, but would allow you to make other differences localizable.)

...just a thought...

Slade

12:18 am on Aug 28, 2002 (gmt 0)

10+ Year Member



IRT a new server name.

This doesn't help if you allow access to mysecretserver.main.com from the internet.

It will still need a robots.txt or other keepaway method because it could be found by accident somehow: back to 1^2 (square one)

bcc1234

12:24 am on Aug 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Put a robots.txt on the server blocking all spiders. Forgot to add, just change the publishing to not publish the robots.txt file so that you dont end up blocking on the main site. The only other option I can see would be to take the staging server offline and only have it local.

Configure your webserver to pull robots.txt file from a separate location.
So you'll have something like /home/zzz/robots.txt and all hosts will be configured to pass this file on all domains.

And your sites will still be in their own directories.
That way you won't copy robots.txt to the production server by mistake.

jdMorgan

2:43 am on Aug 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



threecrans,

I like jatar_k's idea of a new subdomain. If this was my site, I would:

Put 301 permanent redirects on the current staging server domain to forward the existing-link traffic to the "published" domain.

Then create a new subdomain on the staging server, and move your in-development site content there. If you warn all development participants not to publish a link to the new subdomain content, then you won't have to have a different robots.txt or require username/password access, and that will obviate your worries about having an incorrect robots.txt published. It is very hard to keep domains and subdomains "secret" on the web, but if you don't publish links to the new staging server subdomain, and if you refrain from using Google's (and other SE's) toolbars in that subdomain, your new staging server subdomain should remain low-profile enough to be ignored by the Search engines.

Also, a new staging subdomain will allow you to place a robots.txt file on the old staging server domain in the future if you can get the majority of the high-traffic external links to the old staging server domain updated. Until then, you may not want to lose the link-pop and PageRank of those old incorrect links.

Until that time the 301 permanent redirects will prevent most search engines from listing your old staging server domain. AFAIK, 302 temporary redirects won't work for the SEs, only to redirect users. A 301 is required to get the SEs to drop the old staging server domain, even if there are still links to it on the web.

Jim

threecrans

2:32 pm on Aug 28, 2002 (gmt 0)

10+ Year Member



My solution.

Ok, I went with the idea first put forth by jatar_k (new subdomain) and backed up by jdMorgan...also taking into account advice by Slade (what if new subdomain gets indexed?).

I changed the domain staging.main.com to point to a new server, one that doesn't have duplicate content. The server that was originally staging.main.com is completely inaccessable from the Internet (but still accessable internally). I created a custom 404 error page on the new server (the one that is now staging.main.com) with links pointing to appropriate pages on the main server. That way a 302 is never sent. Anyone see any potential problems here?

Nobody directly addressed the problem with sending a 302...so I'll put forth my own theory. Let me know what you think.

I have two servers with nearly identical content. Google thinks the one with a lower page rank (i.e. the staging server) is engaging in a less than wholesome tactic of duplicating page content, thus several PR0 penalties crop up on some of the pages of the staging server.

If I 302 redirect from staging server to main server, it appears my main server is now engaging in the spamming tactic as well, in which case the PR0 penalties could propogate to main.

This is based on absolutely no hard evidence. Please correct me if I am missing something.

jdMorgan

7:41 pm on Aug 28, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



threecrans,

I strongly recommend you use a 301-Moved Permanently redirect to avoid problems. A 301 effectively "tells" the search engine that you've moved the pages and are not attempting a duplicate-content exploit. The 301 causes the search engine to drop the old address (your staging server), update the URL it is using, and for Google, transfer the PR from the old URL to the new URL if it's higher. I believe that's what you want to accomplish.

A 302-Moved Temporarily will just make it keep what it's got, and check again the next time it spiders your domain, so your problem of having your staging server listed does not go away.

Just my opinion, though...

Jim

duggelz

7:45 pm on Aug 28, 2002 (gmt 0)



Google tells you right on their website how to handle this:

Finally, if your old URLs redirect to your new site using HTTP 301 (permanent) redirects, our crawler will know to use the new URL. Changes made in this way will take 6-8 weeks to be reflected in Google.

This to optimal way to handle it, your PR will be correctly merged, googlebot will be happy, etc, as GoogleGuy has confirmed in the past.

HTTP 302 redirects, on the other hand, may well result in the server being eliminated as a duplicate. And HTTP 404 responses will mean that you lose all the PR that would go otherwise flow to the main server via these links to the staging server.

jatar_k

7:47 pm on Aug 28, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



agreed jdMorgan,

People often refer to 302 as moved but as jd points out 302 is more along the lines of keep these pages listed not the ones I am sending you to because these are still the real pages.

301 means gone forever so list where I am sending you instead.

301 is what you need threecrans.

threecrans

2:00 pm on Aug 29, 2002 (gmt 0)

10+ Year Member



Did some research on the possibility of sending a 301. I use IIS, and can configure the server (Properties -> Home Directory -> Redirection to a URL) to send a 301 when any page is requested. This configuration causes the response to send "301 Error".

I could intercept the request in the global.asa (Session_OnStart) to send a "301 Moved" or a "301 Moved Permanently" to any requests.

Does Google take into account the response status text when indexing, or does it simply lump all "301" responses into the same category. If it does take the response status text into account, which is better: "301 Error", "301 Moved", or "301 Moved Permanently"? (I'm guessing the Moved Permanently).

jdMorgan

4:07 pm on Aug 29, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



threecrans,

This topic - of IIS sending a "weird" text string with the 301 code - came up here recently. You might try a site search - I would, but I'm off to a meeting. AFAIK, the meaning of 301 is "Moved Permanently", no matter what text is sent with the code. This is defined by the first digit of the code - 2xx is "OK", 3xx is "moved" or "not modified", 4xx is "error" or "authorization required", etc. I suspect that the text is for human eyes only. But, I'm not sure if all clients will respond correctly to the 301 code and ignore the text.

Jim