homepage Welcome to WebmasterWorld Guest from 54.161.166.171
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Prevent Google (and others) indexing https pages?
lee_sufc




msg:4619823
 4:17 pm on Oct 29, 2013 (gmt 0)

I have one page on my site (.cart.html) that uses https://

To prevent any potential issues with duplicate content in Google, I have done the following:

1) Inserted <meta name="robots" content="noindex,nofollow"> into the head of cart.html

2) Changes all internal links on that page to absolute links using 'http://'

Is this OK? Is there anything else I should be doing?

 

netmeg




msg:4619835
 5:05 pm on Oct 29, 2013 (gmt 0)

As a general rule, I NOINDEX all cart pages anyway. But depending on what web server you're running on, there are ways to prevent https from being indexed. If you're on Apache, then maybe hie thee to the Apache forum here.

phranque




msg:4619872
 8:13 pm on Oct 29, 2013 (gmt 0)

1) this is correct

2) you should refer to the canonical url which in this case starts with https://

3) make sure robots aren't excluded from crawling this url by checking https://www.example.com/robots.txt

lee_sufc




msg:4619883
 9:24 pm on Oct 29, 2013 (gmt 0)

Thanks for the replies...

@phranque - in relation to 2) does this mean I should change all links in the cart page to be secure? (at present the page links to non-https versions).

3) Not sure I understand? I have .cart.html blocked in my robots as I don't want it indexed at all - why would you suggest no excluding it?

lee_sufc




msg:4619885
 9:36 pm on Oct 29, 2013 (gmt 0)

in addition to the above, I have also added rel="nofollow" to the cart link from each product page (trying to cover all bases).

phranque




msg:4619899
 11:00 pm on Oct 29, 2013 (gmt 0)

- if your cart page uses secure protocol then you should link to secure protocol.
otherwise you are requiring a 301 redirect every time someone clicks through to your cart page.

- if you exclude a bot from crawling /cart.html then the bot will never see the meta robots noindex.
while googlebot won't be able to index the content on an excluded page, google may index the url itself, using a title chosen by google, and using the following description in the SERP snippet:
A description for this result is not available because of this site's robots.txt - learn more [support.google.com]


- rel="nofollow" is not at all about noindexing, it's about not passing PageRank and anchor text across links.
if you put rel="nofollow" on all your internal links to the cart page then you are relying on external links to get that page discovered and noindexed.
read the Webmaster Tools Help page for rel="nofollow":
http://support.google.com/webmasters/answer/96569?hl=en [support.google.com]
you'll note there are 3 suggested cases for using rel="nofollow" - untrusted content, paid links and crawl prioritization.

aakk9999




msg:4619900
 11:10 pm on Oct 29, 2013 (gmt 0)


lee_sufc:
2) Changes all internal links on that page to absolute links using 'http://'

phranque:
2) you should refer to the canonical url which in this case starts with https://

lee_sufc:
@phranque - in relation to 2) does this mean I should change all links in the cart page to be secure? (at present the page links to non-https versions).

phranque:
- if your cart page uses secure protocol then you should link to secure protocol.
otherwise you are requiring a 301 redirect every time someone clicks through to your cart page.

I think phranque and lee_sufc talk cross-wired here.

"That page" in the first lee_sufc quote above is a cart page. What I believe lee_sufc asks is whether outgoing links from the cart page should be changed to link to http version of other pages by using the fully qualified URL.

My response is yes - for all outgoing links that use http. Because if you use root-relative href= then you will have https schema for your non-https pages.

To further qualify: If you for example have a main nav where all links are normally http, then when on https cart page, this main nav should be fully qualified href="http://example.com/main-nav-link1" etc and not href="/main-nav-link1" as the second one will then lead to https version of the page, which is not what you want.

Equally, link to cart from other pages should be fully qualified with https.

[edited by: aakk9999 at 11:15 pm (utc) on Oct 29, 2013]

lee_sufc




msg:4619901
 11:13 pm on Oct 29, 2013 (gmt 0)

thanks, the only thing I don't understand is the 301 redirect part? Why would this happen if I link to non https pages on the Cart? The reason for me not using the https links back to other pages was to further prevent any chances of duplicate content.

lee_sufc




msg:4619903
 11:18 pm on Oct 29, 2013 (gmt 0)

aakk9999 thanks - that makes more sense to me.

Just to clarify (sorry, I probably didn't make it clear enough originally).

Cart.html is the page where my clients pay. This is the only page across the entire site using SSL. On this page, links back to the rest of the site, ie: Home, link to the http:// version.

I then added no-follow links to the cart from the product pages and blocked the cart page in robots.txt.

aakk9999




msg:4619904
 11:19 pm on Oct 29, 2013 (gmt 0)

Because they might have been picked up in the past?

Ideally:
a) On the cart page, link from cart page to all other pages that should be non-https by using absolute URL including schema (http)
b) On all other pages, link to cart page using https (absolute) and all other pages can be root-relative (which will make them http)
c) Any URL that should be http - if you receive https request, return 301 to http version of the page
d) Equally, for the cart page, receiving http request should redirect to https version

So why redirects even if you link to the correct schema?
It is very easy to get the whole site indexed by wrong schema if you have just one link somewhere (on your site OR on external site pointing to your site) that links to your site using https for the page that should be http (because of root-relative URLs other pages use).

JD_Toims




msg:4619917
 12:54 am on Oct 30, 2013 (gmt 0)

I think I'd just keep it super simple [or maybe super oops proof is more accurate] and handle it via .htaccess:

RewriteEngine on
# Keep the bots off the cart page and the page out of the index
RewriteCond %{HTTP_USER_AGENT} (:?google|bing)bot [NC]
RewriteRule cart\.html$ - [F]

# Canonicalize everything except cart to http & www
RewriteCond %{HTTP_HOST} !^(www\.example\.com)$ [OR]
RewriteCond %{HTTPS}<>%{REQUEST_URI} !^off<>/cart\.html$
RewriteRule .? http://www.example.com%{REQUEST_URI} [R=301,L]

ZydoSEO




msg:4619936
 3:36 am on Oct 30, 2013 (gmt 0)

Preventing the bots from crawling the cart page will prevent any PageRank or "link juice" passed to the cart page (which is often linked to by almost every page on the site) from being recirculated around the site. It will just go into a black hole and be wasted.

I would add a <meta name="robots" content="noindex"> element to the cart page and allow it to be crawled. This will allow the cart page to accumulate PageRank/link juice and pass it back out on any outbound links that might appear on the cart page while preventing it from ever appearing in the SERPs.

Blocking the cart page with robots.txt will NOT necessarily prevent it from showing in the SERPs should Google decide it is a good match for some particular query based on inbound links they have found. It simply prevents it from being crawled (and from accumulating & distributing PageRank).

Then I would use Mod_Rewrite to implement 2 redirects:

1) If the carts page is requested with something other than HTTPS then 301 redirect to the cart page WITH HTTPS

2) If any non-cart page on the site is requested with something other than HTTP then 301 redirect to that same page WITH HTTP.

Simply fixing the links on your site so they link to pages with the proper protocol is not fool-proof. Another site can still link to a non-cart page with HTTPS and get those non-cart pages (possibly even your entire site if using relative URLs in your on-site links) indexed with HTTPS creating the very situation you're trying to prevent.

phranque




msg:4619977
 6:40 am on Oct 30, 2013 (gmt 0)

thanks for catching my fumble, aakk9999!

i misunderstood lee_sufc and responded too quickly.


lee_sufc, all the correct answers are in these preceding 3 posts by aakk9999, JD_Toims and ZydoSEO.

lucy24




msg:4620001
 9:06 am on Oct 30, 2013 (gmt 0)

# Keep the bots off the cart page and the page out of the index
RewriteCond %{HTTP_USER_AGENT} (:?google|bing)bot [NC]
RewriteRule cart\.html$ - [F]

Well, that would definitely keep robots out (though why bother constraining it to google and bing? does any human UA contain the string "bot"?) but how would it prevent indexing?

RewriteCond %{HTTPS}<>%{REQUEST_URI} !^off<>/cart\.html$

Can you translate into English? I remember seeing the <> notation once before but can't remember the explanation.

g1smd




msg:4620005
 10:07 am on Oct 30, 2013 (gmt 0)

<> is an arbitrary delimeter that would never appear in a real request. Its use here is merely to unabiguously concatentate two pieces of data.

Make sure the cart page is NOT blocked in robots.txt.
You should place meta robots noindex on the cart page itself.
Redirect cart requests to https and www.
Redirect non-cart requests to http and www.
Links going out from the cart page should begin http.
Link to the cart page as https from all other pages of the site.
You can link out from all other pages of the site as http if you want, but with the redirect in place I rarely bother. I only declare protocol on links pointing to and from the small number of https pages.

lucy24




msg:4620007
 10:27 am on Oct 30, 2013 (gmt 0)

<> is an arbitrary delimeter that would never appear in a real request. Its use here is merely to unambiguously concatentate two pieces of data.

Now, if someone else can translate that into English ;) Is it hidden in the apache docs somewhere?

In context it looks as if
a<>b c<>d
means "a is c and b is d".
Does it have to go in a single line to work with the [OR] in the preceding RewriteCond? "condition x, or both y and z"

phranque




msg:4620008
 10:28 am on Oct 30, 2013 (gmt 0)

but how would it prevent indexing?


a 403 response won't get indexed.

phranque




msg:4620009
 10:37 am on Oct 30, 2013 (gmt 0)

Now, if someone else can translate that into English wink Is it hidden in the apache docs somewhere?


it's an arbitrary text string and the < and > characters have no special meaning in a regular expression.
this works as a delimiter in this case because you will never find the '<>' string in %{HTTPS} and %{REQUEST_URI} will always have a leading slash.

lee_sufc




msg:4620091
 4:52 pm on Oct 30, 2013 (gmt 0)

Could I just check with everyone that this would be OK. Rather than worrying about blocking the cart etc, how about this in my htaccess to purely rewrite anything that isn't cart.html back to http:

RewriteEngine On
RewriteCond %{HTTPS} on
RewriteCond %{REQUEST_URI} !cart.html
RewriteRule ^(.*)$ http://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

or should I stick with JD_Toims' suggestion about and use the following:

RewriteCond %{HTTP_HOST} !^(www\.example\.com)$ [OR]
RewriteCond %{HTTPS}<>%{REQUEST_URI} !^off<>/cart\.html$
RewriteRule .? http://www.example.com%{REQUEST_URI} [R=301,L]

Thanks for all your help so far - it's been very useful!

[edited by: aakk9999 at 5:40 pm (utc) on Oct 30, 2013]
[edit reason] Unlinked RewriteRule [/edit]

JD_Toims




msg:4620101
 6:32 pm on Oct 30, 2013 (gmt 0)

but how would it prevent indexing?

What phranque said.

Now, if someone else can translate that into English ;)

What phranque said + additional examples, because it could be anything that won't match or cause issues, so:

%{HTTPS} will always == on | off
%{REQUEST_URI} will always begin with a /

So the line could be:
RewriteCond %{HTTPS}MyRocker%{REQUEST_URI} !^offMyRocker/cart\.html$ [OR]
RewriteCond %{HTTPS}-InSpace-%{REQUEST_URI} !^off-InSpace-/cart\.html$ [OR]
RewriteCond %{HTTPS}<topic>%{REQUEST_URI} !^off<topic>/cart\.html$

Basically, the use of <> is for "separation" of the variables, which makes the line easier to read and understand [well, more readable at least lol]. It also makes it so the condition is essentially the equivalent of ^off$ ^/location\.ext$ in two different conditions combined in one.

RewriteRule ^(.*)$ http://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

Hard-coding the www / non-www variant into the redirect location is almost always better than using %{HTTP_HOST}

Personally, I can't think of an occasion right now where I'd recommend the use of %{HTTP_HOST} in a redirect location, just because it lends itself so well to "chaining" or "stacking" redirects by sending you to the www / non-www variant you're already visiting.

In this specific case if the site is canonicalized to www and you request a non-www location [not cart] using https, you'll likely initially be redirected to make an http request for the non-www variant of the location you wanted to visit, then be redirected again to make another request for the www variant of the location, which is a redirect chain and would not happen if the canonical of the www / non-www was specified in the first location redirected to.

* Oh, and if you're going to go with what you have remove the (.*) and use .? -- There's no reason at all to match + store for back-reference + using %{REQUEST_URI} + checking every single request + not back-referencing anything anywhere in the ruleset.

** I vote for using mine -- It accomplishes the canonicalization of the entire site in a single rule and single redirect, so there's less room for error and less chance of stacking redirects than there would be using individual rules/conditions for everything. ;)

lucy24




msg:4620158
 9:21 pm on Oct 30, 2013 (gmt 0)

it could be anything that won't match or cause issues

Ah. Now all I have to do is wrap my brain around the idea of following %{SOME_THING} with an intentionally invalid value ... and somehow this does not result in 500 consecutive server explosions.

"cause issues" is awfully ambiguous though.

JD_Toims




msg:4620168
 9:58 pm on Oct 30, 2013 (gmt 0)

and somehow this does not result in 500 consecutive server explosions.

This is probably better for an Apache thread, but the short version is:

The left side of a condition is a string which accepts and, prior to comparison, expands variables and back-references. This means, even though I don't know why someone would, the left side of a condition could be: SomeTextYouTypeIn

It may be easier for some to understand with an example of what is basically a PHP equivalent:

NOTE: ['HTTPS'] is non-empty if HTTPS is used in PHP, otherwise it's empty, so it's easier to "see" with ['REQUEST_PROTOCOL']
Also, I'm comparing "string to string" in PHP rather than "string to regex".


if($_SERVER['REQUEST_PROTOCOL'].'<>'.$_SERVER['REQUEST_URI'] !== 'HTTPS/1.1<>/cart.html') { /* do something */ }

[edited by: JD_Toims at 10:08 pm (utc) on Oct 30, 2013]

lee_sufc




msg:4620169
 10:02 pm on Oct 30, 2013 (gmt 0)

For someone like myself who has no idea about Apache, reading these last few posts give me brain-ache - it's like you're all speaking a different language!

phranque




msg:4620178
 10:48 pm on Oct 30, 2013 (gmt 0)

that's funny - to me it sounds like an echo.

phranque




msg:4620179
 10:50 pm on Oct 30, 2013 (gmt 0)

and somehow this does not result in 500 consecutive server explosions.


it's just a RewriteCond.
how much damage could it DO?

lucy24




msg:4620195
 12:24 am on Oct 31, 2013 (gmt 0)

it's just a RewriteCond.
how much damage could it DO?

Try making a RewriteCond with a trailing escaped literal space and see.

I think my problem with <> is that I see it and interpret it as != Can't remember what language uses this notation, but I must have known it at some point.

g1smd




msg:4620217
 2:11 am on Oct 31, 2013 (gmt 0)

The usual test is to see if "varname1" equals "value1".

The new test is simply looking to see if "varname1<>varname2" is equal to "value1<>value2".

In theory you could use:
"varname1@@varname2" and "value1@@value2" or
"varname1##varname2" and "value1##value2"
or various other characters.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved