Welcome to WebmasterWorld Guest from 54.167.75.28

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Dev Mirror of site got indexed by google - 301 or disallow? or both?

     
11:32 am on May 6, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 2, 2003
posts:76
votes: 0


Hi all,
purely by chance, last night I discovered google has indexed our dev mirror of our live site on our dyndns domain. Serves me right for thinking google will never discover it by itself.

As soon as I have realised this i have implemented a 301 rule in dev server's htaccess, to redirect every single page to its counterpart on the live site. I've also added the dev site to our webmaster tools, so I could potentially ask for a removal if need be.

my question is what are the next steps? Should I:

a. leave everything as is, just 301 and hope for google to remove the dev domain from the index
b. additionaly add noindex & nofollow metatag
c. additionaly ask for a removal of the whole site through webmaster tools
d. all of the above?

also im wondering how did the crawler find this dev IP? Im pretty sure there are no backlinks to it, the only possibility i can think of is google using its tracking data (from office users accessing the dev box through their browsers) to harvest these IPs and then crawl them? what gives?

also, funny thing is it seems our rankings havent been affected, dev box was live for the last few months, if i pull up google analytics data our organic traffic has been on a steady upward curve in the last year.
Another funny thing (its only funny becuase i would be crying right now had I discovered our rankings on a downward curve) is that now we rank page one for some high value keywords, and i can also see the dev domain on page 1 too!
4:40 pm on May 6, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


First it's not hurting anything, so you have some options:
B + C might be an idea.
I wouldn't worry about A personally ... You don't have any links coming in to it, so just get it out of the index.

For future developments and what I might do rather than the above:
e. None of the Above

In the .htaccess or (better imo) httpd.conf
# Tell them it's gone.

# They don't need to see it anyway.
# Just remember to remove the block from a live site
# That's why I go httpd.conf on the dev server:
# When I move the .htaccess I don't have to
# worry about it being in there.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} google|bing|slurp [NC]
RewriteRule .? - [G]
5:13 pm on May 6, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 2, 2003
posts:76
votes: 0


what exactly does this rule do to crawlers? just deny?

I was worried a duplicate version of site in google index might be hurting the rankings, that is what the general mantra on the net is, but as i said i havent noticed this in analytics...
5:33 pm on May 6, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


what exactly does this rule do to crawlers? just deny?

It serves them a '410 Gone' error. [w3.org...]

They'll usually drop the pages from the index quickly and not revisit them as often as they would with some other methods used to remove or de-index pages.

...but as i said i havent noticed this in analytics

I personally believe the stats, not the hype...
6:02 pm on May 6, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 7, 2003
posts: 750
votes: 0


rel canonical ever page on your site to itself. That way when googlebot finds the dev site, it gets rel canonical to the live site.
6:15 pm on May 6, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


I wouldn't want a dev server crawled ... You can serve [F] (403 Forbidden) if you don't want to use [G] Gone, but either way, I would (and do) keep them off dev boxes by blocking them at the root ... If there were links or 'some gain' from the site being found and indexed I'd go with deadsea's idea, but a dev box? Nah, imo they don't need to be crawling it at all...
6:29 pm on May 6, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 7, 2003
posts: 750
votes: 0


I try to firewall my dev servers off too, but there have been numerous occasions where sites that I've worked with had a dev site become accessible to the outside world accidentally:

1) firewall misconfiguration
2) preview site launched publicly on purpose
3) developer working from home with the site on the laptop and no firewall at home

We decided that we shouldn't have our dev servers crawled, but as a precaution we would rel canonical every page to itself on the live site. That way if a dev site does get crawled, it isn't an SEO problem.

We judged that the rel canonical was less risky than a custom robots.txt because there is no special configuration needed for the live site.
6:44 pm on May 6, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


Ah, rel="canonical" to the live site 'in addition to' a block of some kind, yes, definitely, why not?
I totally agree.
8:33 pm on May 6, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If you block them, they cannot see the rel=" canonical".

In the long term I would put the site behind a .htpasswd challange. Before that, you need to remove the URLs from the SERPs.

Immediately throwing up the .htpasswd rule would mean that Google takes a long time to drop the URLs from the SERPs. Before adding the .htpasswd rule, I would serve a 301 redirect to the live site for a month, then maybe serve a 404 for a few weeks.

I have seen customers sign up, log in and try to buy stuff on dev sites! Use .htpasswd to keep them all out.
8:39 pm on May 6, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member themadscientist is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 14, 2008
posts:2910
votes: 62


I try to firewall my dev servers off too, but there have been numerous occasions where sites that I've worked with had a dev site become accessible to the outside world accidentally:

We decided that we shouldn't have our dev servers crawled, but as a precaution we would rel canonical every page to itself on the live site. That way if a dev site does get crawled, it isn't an SEO problem.

Ah, rel="canonical" to the live site 'in addition to' a block of some kind, yes, definitely, why not?

If you block them, they cannot see the rel=" canonical".

We know ... The 'plan' from those quotes is to block them, but as a 'secondary precaution' there's nothing wrong with adding a rel="canonical", in fact, then it will be present on the live site too when the dev site is moved! ;)
12:57 am on May 7, 2011 (gmt 0)

Junior Member

10+ Year Member

joined:Dec 2, 2003
posts:76
votes: 0


Kool, thanks for all the comments...

I think i will leave it be for 1 month with a 301 in place, then issue a 401 to bots, and perhaps ask for a removal after 2 months. Possibly rel canonical but it shouldnt be a problem as Im the only one tinkering with the boxes, so once its blocked its blocked forever.

edit: the reasoning behind a 301, regardless of no incoming links is for bots, they will pickup the redirect and hopefully only leave the destination page (live site) in the index. Logically the next step seems a 401 to hide the server from googlebots long arm.