Forum Moderators: Robert Charlton & goodroi
This is my first post, so I will try to be 'clear'...
Here is the problem:
Google recently started indexing and linking to our main site with [example.com...] instead of http://www.example.com
I have read about "canonicalization" and have also read about adding a second robot.txt file etc... and all seems quite confusing...
1st off, I am not sure we have the ability to put two robots files, and second, is there not a SIMPLE way to make sure Google only crawls http:// content? our site is quite large, and therefore doing a mod on all the pages (i.e. in the meta tags) would be quite cumbersome...
Also, when Google started doing this, it started lowering the number of pages that we had indexed :-(
Any help would be really appreciated!
Thanks
[edited by: Robert_Charlton at 5:58 pm (utc) on Mar. 22, 2009]
[edit reason] changed to example.com - it can never be owned [/edit]
The best way to fix a canoncial problem with the https: protocol is not to allow the server to make that response at all. Installing the secure cert only on a dedicated hostname, such as secure.example.com, and not on the main domain. I do understand that this may be an undertaking that requires recoding a lot of pages.
Beyond that, what kind of server and hosting are you working with? If you're on Apache and can use your own .htaccess file, or if you're on IIS, then this thread [webmasterworld.com] has some accurate recommendations.
As a minimum, first step, why not put a bandaid on the situation: add the canonical tag to those https: urls that Google has already indexed. Then study the situation to decide on your next steps.
We are on Apache, I have access to our own .htaccess file and the robots.txt file.
What I have done thus far (and I can't tell if it will hurt/work or not..) is the following:
On my robots.txt file I have added:
Disallow: [example.com...]
Disallow: [example.com...]
on the htaccess I did nothing "yet" as I am not sure what to actually add?
And then I went and found all the pages that i had that we actually use for checkout process (i.e. with https) and made all the links from these pages with "absolute' urls pointing to http://www. )
Makes sense?
if you know of what I should/should not put in my robots.txt and in our htaccess file that would be great :-)
Thanks.
[edited by: Robert_Charlton at 8:39 pm (utc) on Mar. 23, 2009]
[edit reason] deactived link [/edit]
The protocol and domain they apply to is implied from that which was requested in order to get that file.
.
You could have two robots.txt files robotshttp.txt and robotshttps.txt and set up an internal rewrite (using RewriteRule) so that the appropriate one is served depending on the protocol in the request. You'd need a RewriteCond to test for that.
[edited by: g1smd at 12:29 pm (utc) on Mar. 23, 2009]
will do nothing?
I have also done
Disallow: /securedfolder/
So that the checkout pages will not be indexed?
Does that make sense?
1) the canonical tag - I built a function to ensure that all pages returned http in the tag, not https. It incidentally replaces /index.ext with / to remove the home page canonicalization problem. There are a lot of other things that can be added (selected querystrings etc) but that's the basis.
2) if the UA belongs to a robot, block all of the payment etc pages completely. My own test for UA is quite complex since it deals with blocked IPs, exploits etc but a simple one for known google/yahoo/msn UAs is a start.
3) keep all payment and forms etc pages in separate folders and ban them in robots.txt.
4) add a robots tag of noindex,nofollow,noarchive to every page.
This should be a good start but remember that a few bad bots will always get through.
The only one in my list that absolutely requires technical knowledge is 2).
RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]
Create robots_ssl.txt:
User-agent: *
Disallow: / '
I will let you know if/when it works... Is there anyway for me to 'speed up" the proper re indexing?
THANKS
what is the canonical tag?
Check out the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page. There's a good thread in there about the canonical tag from when it was introduced in February 2009.