Forum Moderators: Robert Charlton & goodroi
Problem is if you search 'pages from UK only' which is where most of their traffic comes from the site is nowhere to be found because that datacenter keeps indexing a https version of the index page. Its only that datacenter though. Why, and how can I fix this?
Each port must have its own robots.txt file. In particular, if you serve
content via both http and https, you'll need a separate robots.txt file for each
of these protocols. For example, to allow Googlebot to index all http pages
but no https pages, you'd use the robots.txt files below.For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /[url=http://www.google.com/support/webmasters/bin/answer.py?answer=35302&query=https&topic=&type=]Webmaster Help Center
I have done all the reccomended redirects with .htaccess for the non-www to www & index.* to / as reccomended by the experts here, but I still have issues with G indexing the https along with the http.
Back To Watching
WW_Watcher
User-agent: *
Disallow: /
Then I added to my .htaccess under my current redirects.
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots.txt$ robots_ssl.txt
Does anyone see any issue with this? Will this stop the indexing of my site as https, and still work fine with the http:
I tested http://www.example.com/robots.txt and it shows my normal robots.txt
I tested https://www.example.com/robots.txt and it shows the contents of the robots_ssl.txt
Thanks In Advance!
Back To Watching
WW_Watcher
[edited by: tedster at 3:10 am (utc) on Nov. 28, 2006]
[edit reason] use example.com [/edit]
IE: secure.mydomain
Then you have two completely seperate domain roots and two seperate robots.txt files.
I'd be a bit worried about Google getting mixed up by the method you came up with WW_Watcher.
But hey I don't have a port 443 service to worry about.
This was one of two solutions they had listed to solve my problem, the other was a php include to put in a noindex into the page when the call came from https, I did not want to have to alter every page on the site to do the php include.
This appears to be working quite well from every way I have looked at it.
I found this solution by searching on how to stop google from indexing https, and found an article written by Dan Johnson, Technical & Marketing Consultant at SEO Workers.
Back To Watching
WW_Watcher
For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /
For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /
But guess what? Every other version of google is fine with it and it is ranking as it should apart from google.co.uk pages within the uk. It has stopped indexing https, but it has stopped indexing the index page all together. Still indexing the rest of the site, just not the index page?
What should i try next? This one has got me really stumped!
Were they previously indexing both the secure and standard pages, or only the secure pages? If the standard pages were not indexed properly, it may take some time for the datacenter to "catch up" with the data. Optionally, you might try disallowing Googlebot specifically rather than using the global disallow on the secure pages.
I am assuming you have both the secure and standard pages on the same subdomain, like we did at webnauts (which prompted me to write the above-referenced article).
Dan Johnson
<Sorry, no URLs.
See Terms of Service [webmasterworld.com]>
[edited by: tedster at 5:20 am (utc) on Dec. 8, 2006]