HTTPS being indexed as main site - easiest fix for large site? - Google Search and SEO forum at WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

HTTPS being indexed as main site - easiest fix for large site?

KBTeachers

9:50 am on Mar 22, 2009 (gmt 0)

Hi all,

This is my first post, so I will try to be 'clear'...

Here is the problem:

Google recently started indexing and linking to our main site with [example.com...] instead of http://www.example.com

I have read about "canonicalization" and have also read about adding a second robot.txt file etc... and all seems quite confusing...

1st off, I am not sure we have the ability to put two robots files, and second, is there not a SIMPLE way to make sure Google only crawls http:// content? our site is quite large, and therefore doing a mod on all the pages (i.e. in the meta tags) would be quite cumbersome...

Also, when Google started doing this, it started lowering the number of pages that we had indexed :-(

Any help would be really appreciated!

Thanks

[edited by: Robert_Charlton at 5:58 pm (utc) on Mar. 22, 2009]
[edit reason] changed to example.com - it can never be owned [/edit]

tedster

4:11 am on Mar 23, 2009 (gmt 0)

Hello KBTeachers, and welcome to the forums.

The best way to fix a canoncial problem with the https: protocol is not to allow the server to make that response at all. Installing the secure cert only on a dedicated hostname, such as secure.example.com, and not on the main domain. I do understand that this may be an undertaking that requires recoding a lot of pages.

Beyond that, what kind of server and hosting are you working with? If you're on Apache and can use your own .htaccess file, or if you're on IIS, then this thread [webmasterworld.com] has some accurate recommendations.

As a minimum, first step, why not put a bandaid on the situation: add the canonical tag to those https: urls that Google has already indexed. Then study the situation to decide on your next steps.

KBTeachers

12:17 pm on Mar 23, 2009 (gmt 0)

Hi Tedster.

We are on Apache, I have access to our own .htaccess file and the robots.txt file.

What I have done thus far (and I can't tell if it will hurt/work or not..) is the following:

On my robots.txt file I have added:

Disallow: [example.com...]
Disallow: [example.com...]

on the htaccess I did nothing "yet" as I am not sure what to actually add?

And then I went and found all the pages that i had that we actually use for checkout process (i.e. with https) and made all the links from these pages with "absolute' urls pointing to http://www. )

Makes sense?

if you know of what I should/should not put in my robots.txt and in our htaccess file that would be great :-)

Thanks.

[edited by: Robert_Charlton at 8:39 pm (utc) on Mar. 23, 2009]
[edit reason] deactived link [/edit]

g1smd

12:25 pm on Mar 23, 2009 (gmt 0)

You can only put path parts in the robots.txt file. That is, do not include domain names.

The protocol and domain they apply to is implied from that which was requested in order to get that file.

You could have two robots.txt files robotshttp.txt and robotshttps.txt and set up an internal rewrite (using RewriteRule) so that the appropriate one is served depending on the protocol in the request. You'd need a RewriteCond to test for that.

[edited by: g1smd at 12:29 pm (utc) on Mar. 23, 2009]

KBTeachers

12:27 pm on Mar 23, 2009 (gmt 0)

So I guess the
Disallow: [example.com...]
Disallow: [example.com...]

will do nothing?

I have also done
Disallow: /securedfolder/

So that the checkout pages will not be indexed?

Does that make sense?

g1smd

3:41 pm on Mar 23, 2009 (gmt 0)

The folder rule is the only one that could possibly work, but it still doesn't solve the other issues.

dstiles

10:52 pm on Mar 23, 2009 (gmt 0)

I've used a few of approaches to preventing bots accessing with https and to keep http accesses out of payment sections as well (not all payment methods require high security - eg invoiced and some types of paypal).

1) the canonical tag - I built a function to ensure that all pages returned http in the tag, not https. It incidentally replaces /index.ext with / to remove the home page canonicalization problem. There are a lot of other things that can be added (selected querystrings etc) but that's the basis.

2) if the UA belongs to a robot, block all of the payment etc pages completely. My own test for UA is quite complex since it deals with blocked IPs, exploits etc but a simple one for known google/yahoo/msn UAs is a start.

3) keep all payment and forms etc pages in separate folders and ban them in robots.txt.

4) add a robots tag of noindex,nofollow,noarchive to every page.

This should be a good start but remember that a few bad bots will always get through.

Key_Master

11:06 pm on Mar 23, 2009 (gmt 0)

In .htaccess:


RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

Create robots_ssl.txt:

User-agent: *
Disallow: /

dstiles

8:14 pm on Mar 24, 2009 (gmt 0)

Which, of course, is not available to all. Not least because 99.9% of web site owners have no idea how to use .htaccess even if a) it exists on the server and b) they're given access to it. The canonocal tag is available to all.

The only one in my list that absolutely requires technical knowledge is 2).

KBTeachers

2:58 pm on Mar 29, 2009 (gmt 0)

hey Key_Master, I have implemented your suggestion:
"In .htaccess:

RewriteCond %{SERVER_PORT} ^443$
RewriteRule ^robots\.txt$ robots_ssl.txt [L]

Create robots_ssl.txt:

User-agent: *
Disallow: / '

I will let you know if/when it works... Is there anyway for me to 'speed up" the proper re indexing?

THANKS

particleman

3:15 pm on Mar 29, 2009 (gmt 0)

Just to share my experience. I had this exact same issue, seems google crawled into our payment process which let them onto the entire https:// version of our site. Over the years over 1000 pages got indexed like that. We started with the robots.txt fix seeing pages change over very slowly, however, the canonical tag was really when we started to see pages change over much faster. We also blocked our payment pages in the robots file.

KBTeachers

3:23 pm on Mar 29, 2009 (gmt 0)

what is the canonical tag? I am not sure I understand where/what it is specifically....

g1smd

4:49 pm on Mar 29, 2009 (gmt 0)

Its a new tag that tells Google what the URL to be indexed for the page is, if it is different to the URL that Google used to access the page. The tag also works for Yahoo, Live, and Ask.

There are several threads about it here.

tedster

3:30 am on Mar 30, 2009 (gmt 0)

what is the canonical tag?

Check out the Hot Topics area [webmasterworld.com], which is always pinned to the top of this forum's index page. There's a good thread in there about the canonical tag from when it was introduced in February 2009.