homepage Welcome to WebmasterWorld Guest from 23.22.194.120
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 33 message thread spans 2 pages: 33 ( [1] 2 > >     
Google indexing https pages as different pages?
helenp




msg:4557747
 9:47 am on Mar 23, 2013 (gmt 0)

Hi,
Google are indexing many pages lately, checking indexed pages in google I saw content like this

Widgets in placename with gizmo for sale
https://www.mysite.com/sales/widgets_for_sale_placename.htm?id=113...
Description 1

Widgets in placename with gizmo for sale
www.mysite.com/sales/widgets_for_sale_placename.htm?id=113
Description 1

Also pages blocked by robots without any parameters are double like that, (had to click on view more pages in google search to see these)

Does I have something wrong in my htaccess file?
AddType application/x-httpd-php5 .htm .html
RewriteEngine On
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.mysite.com/$1 [R=301,L]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ https://www.mysite.com/$1 [R=301,L]
# Redirect non-canonical to www
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{HTTP_HOST} !^(www\.mysite.com\.com)?$
RewriteRule (.*) http://www.mysite.com/$1 [R=301,L]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{HTTP_HOST} !^(www\.mysite.com\.com)?$
RewriteRule (.*) https://www.mysite.com/$1 [R=301,L]
#No permitir direcciones como mysite.com/pagina1.htm/maps/ etc
RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php))/ http://www.mysite.com/$1 [R=301,L]


.

[edited by: Robert_Charlton at 4:39 pm (utc) on Mar 25, 2013]
[edit reason] removed specifics, per Charter [/edit]

 

helenp




msg:4557800
 5:03 pm on Mar 23, 2013 (gmt 0)

I have set up a redirection on every page as per this page, so I hope the pages will be deleted soon:
https://sites.google.com/site/onlyvalidation/page/301-redirect-https-to-http-on-apache-server

lucy24




msg:4557839
 8:40 pm on Mar 23, 2013 (gmt 0)

Your htaccess is fine as far as it goes, but it only covers index and domain name. You need a third piece that looks at the port-and-protocol package. Something like

RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{SERVER_PROTOCOL} https

and vice versa. (If someone says it is better to use {HTTPS} on / off, they are probably right.)

It is a good idea to leave a blank line after each RewriteRule. It doesn't affect the way the module runs, but makes it easier to read and to keep organized.

Does robots.txt even have a setting for protocol? News to me :(

Remember that robots.txt doesn't prevent indexing; it only prevents crawling. So make sure all links are correct: either https or http, but never both.

All this is assuming that any given page can be http or https but not both. If a page can be accessed either way, you're pretty well stuck.

seoskunk




msg:4557849
 9:14 pm on Mar 23, 2013 (gmt 0)

You can write a different robots.txt rule for a mirrored https... something like this :

RewriteCond %{HTTPS} =on
RewriteRule ^robots\.txt$ robots-ssl.txt [L]

In robots-ssl.txt

User-agent: *
Disallow: /

[webmasterworld.com...]

helenp




msg:4557854
 10:31 pm on Mar 23, 2013 (gmt 0)

All this is assuming that any given page can be http or https but not both. If a page can be accessed either way, you're pretty well stuck.

Thanks Lusy and seo,
I remember very well when I bought the https certificate I asked and was told that google considers http and https the same page, so I just didnīt bother about it, and didnīt research.....

What I done as emergengcy, and its rather good as when somebody navigates from a https page, they are redirected to use http again, I added as per this page of google:
https://sites.google.com/site/onlyvalidation/page/301-redirect-https-to-http-on-apache-server
so I added to the http pages:
if ( $_SERVER['HTTPS'] )
{
$host = $_SERVER['HTTP_HOST'];
$request_uri = $_SERVER['REQUEST_URI'];
$good_url = "http://" . $host . $request_uri;

header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $good_url" );
exit;
}

and to the https pages:
if ( !$_SERVER['HTTPS'] )
{
$host = $_SERVER['HTTP_HOST'];
$request_uri = $_SERVER['REQUEST_URI'];
$good_url = "https://" . $host . $request_uri;

header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $good_url" );
exit;
}


Isnt this a good solution?
If it is good any problems in keeping it as it is rather good and easy, not having to touch the htaccess code?
Thanks,

helenp




msg:4557859
 10:35 pm on Mar 23, 2013 (gmt 0)

Remember that robots.txt doesn't prevent indexing; it only prevents crawling. So make sure all links are correct: either https or http, but never both.

I ever had both links, I only have links to https pages, and those links are absoulte, however the rest are root,
so when google enters the absolut https links, then keeps crawling google is on https site...
however as I heard that google doesnīt make any difference between https and http,...wich is not so.

seoskunk




msg:4557860
 10:35 pm on Mar 23, 2013 (gmt 0)

Great solution if you don't use a secure server for payment.

helenp




msg:4557861
 10:37 pm on Mar 23, 2013 (gmt 0)

Great solution if you don't use a secure server for payment.

Sorry, quite dont understand,
I use paypal, but all the pages for payments has this on top:
if ( !$_SERVER['HTTPS'] )
{
$host = $_SERVER['HTTP_HOST'];
$request_uri = $_SERVER['REQUEST_URI'];
$good_url = "https://" . $host . $request_uri;

header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $good_url" );
exit;
}
they should be https, the rest should be http and have a diferent 301 redirection

seoskunk




msg:4557864
 10:46 pm on Mar 23, 2013 (gmt 0)

Cool I take it you don't want affiliates then from your code......

helenp




msg:4557950
 11:08 am on Mar 24, 2013 (gmt 0)

Now I feel a bit lost as I thought from start https and http was considered the same pages.

Before I had relative links, I changed those to absolute between index pages, and the rest I changed to root links, just to avoid mixing up the folders.

Then On the pages that needed https I used an absolute https link, so if the persons after entering a https page instead of leaving kept browsing the person where browsing the site using https, however as this was not the most frequent it didnt matter.

So not sure how to do the linking now, or maybe doesnt matter at all as I have a 301 on every page telling the corect url is http or https.

Should I link an absolute link to the https pages or a root link is just fine as the 301 on the page tells it should be a https page?

And the same if google enter a https page and keep crawling from there google is crawling the https site.
Feel confused.

No my site does not use affiliates, thanks

phranque




msg:4557953
 11:40 am on Mar 24, 2013 (gmt 0)

Isnt this a good solution?


this could work ok as long as all non-canonical requests get redirected to the canonical url in one hop.
if you do one redirect in your .htaccess file and then a 2nd redirect in your script - this solution is not so good.

lucy24




msg:4557961
 11:50 am on Mar 24, 2013 (gmt 0)

If you have links between http pages and https pages, these are the options I can think of:

--use links with leading / and let users stay with https even when it isn't needed. This means the googlebot will eventually have duplicate versions of all your http pages. (Not of https pages, because I assume you don't let http get to those.)
--use links with / and then redirect internal traffic to http as needed. (Generally bad, because you shouldn't be redirecting your own links.)
--use complete links, including protocol, anywhere you link between http and https pages.

Oh, wait, there's one more possibility.

--change the whole site over to https. It will solve one group of problems, but may create others. (I assume there are drawbacks to using https, otherwise all websites everywhere would do it all the time.)

robots.txt is tricky because according to google-- I just found this yesterday while looking up something else--
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.

That makes it sound as if they expect separate robots.txt files for http and https, even if they belong to the same domain. But what are you supposed to do, rewrite to different robots.txt files depending on the robot's protocol? You might be able to do this on a brand-new site: Here are the rules for http, here are the ones for https. You can't really change at this point, though.

helenp




msg:4557962
 11:51 am on Mar 24, 2013 (gmt 0)

this could work ok as long as all non-canonical requests get redirected to the canonical url in one hop.
if you do one redirect in your .htaccess file and then a 2nd redirect in your script - this solution is not so good.

This I did not understand: in one hop.
Not sure I understood you but I suppose you mean, that I cant redirect [page1.php...] to https://page1.php in htacessa and then on the page redirect the same page this time to [page1.php...]

I think I dont, as the redirections I have is to redirect index.htm to / and non www to www
and this I do both for http and https

These are the only redirections I have in my hataccess file:

# REDIRECT htm INDEX PAGES to index/

RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.mysite.com/$1 [R=301,L]

RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ https://www.mysite.com/$1 [R=301,L]

# Redirect non-canonical to www

RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{HTTP_HOST} !^(www\.mysite.com\.com)?$
RewriteRule (.*) http://www.mysite.com/$1 [R=301,L]

RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{HTTP_HOST} !^(www\.mysite.com\.com)?$
RewriteRule (.*) https://www.mysite.com/$1 [R=301,L]

#Dont permit paths as mysite.com/pagina1.htm/maps/ etc
RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php))/ http://www.mysite.com/$1 [R=301,L]

Thanks,

[edited by: helenp at 12:05 pm (utc) on Mar 24, 2013]

helenp




msg:4557963
 11:58 am on Mar 24, 2013 (gmt 0)

--use links with leading / and let users stay with https even when it isn't needed. This means the googlebot will eventually have duplicate versions of all your http pages. (Not of https pages, because I assume you don't let http get to those.)

Lucy, this was what I was doing, but yesterday I saw that google indexed about 50 pages and I have not added that many, so I went into a folder not to get so many pages, and searched in google for that folder and saw https pages that should be http indexed both with http and https.
So I have duplicated content.
So as I said before, as an emergency solution I added a 301 redirection on all pages, the http pages to http and https to https, so at this moment users will ever stay on https even if not needed as soon as they leave https they are redirected to http.

helenp




msg:4557964
 12:02 pm on Mar 24, 2013 (gmt 0)

The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.

That makes it sound as if they expect separate robots.txt files for http and https, even if they belong to the same domain. But what are you supposed to do, rewrite to different robots.txt files depending on the robot's protocol? You might be able to do this on a brand-new site: Here are the rules for http, here are the ones for https. You can't really change at this point, though.

Not sure if I get this, I entered my domain both with http and https and got the robots.txt file for both, so doesnīt google see the robots.txt file equal for both http and https?

lucy24




msg:4557975
 12:49 pm on Mar 24, 2013 (gmt 0)

so doesnīt google see the robots.txt file equal for both http and https

The question is whether they know or care that it's the same. Their own statement makes it sound as if robots.txt only "counts" for https pages if they get it with https, and it only counts for http if they get it with http. With some robots it would be easy to tell because maybe they will make a fresh request for robots.txt before starting in on the https pages. But the googlebot doesn't work that way.

Not sure I understood you but I suppose you mean, that I cant redirect [page1.php...] to https://page1.php in htacessa and then on the page redirect the same page this time to [page1.php...]

No, he means that for example you could have htaccess redirecting
https://www.example.com/directory/index.html
to
https://www.example.com/directory/

and then separately you will have the page php redirecting to
http://www.example.com/directory/

But since we are talking about internal links, this is not likely to happen anyway. The form of the page name, including domain, will be correct already. So there would never be more than one redirect.

helenp




msg:4557980
 12:57 pm on Mar 24, 2013 (gmt 0)

No, he means that for example you could have htaccess redirecting
https://www.example.com/directory/index.html
to
https://www.example.com/directory/

and then separately you will have the page php redirecting to
http://www.example.com/directory/

But since we are talking about internal links, this is not likely to happen anyway. The form of the page name, including domain, will be correct already. So there would never be more than one redirect.

Puf,
I have absolute links between the directories that have there own index pages, and I do redirect index.htm to /
and then separately I have all pages redirecting to either http only or to https only.
I think I have it ok, however my head is hard I guess.

helenp




msg:4557983
 1:01 pm on Mar 24, 2013 (gmt 0)

The question is whether they know or care that it's the same. Their own statement makes it sound as if robots.txt only "counts" for https pages if they get it with https, and it only counts for http if they get it with http. With some robots it would be easy to tell because maybe they will make a fresh request for robots.txt before starting in on the https pages. But the googlebot doesn't work that way.


Hm, I see what you mean now.
so how can one do a separate robots.txt?
Anyway with the redirection I have on all pages I wont have any more pages indexed duplicated as google will leave https as soon as crawler leaves https pages.
At least this is what I guess.
Or do I have to do a robots.txt with excluded pages for the https site?
Isnt this a bug, google should see http as the same page as https? I dont understand why they index both.

Edited, Seoskunk told how to do a robots.txt for https,
but will google search for a robots.txt for ssl when changing from http to https?


You can write a different robots.txt rule for a mirrored https... something like this :

RewriteCond %{HTTPS} =on
RewriteRule ^robots\.txt$ robots-ssl.txt [L]

In robots-ssl.txt

User-agent: *
Disallow: /
[webmasterworld.com...]

phranque




msg:4557992
 1:30 pm on Mar 24, 2013 (gmt 0)

That makes it sound as if they expect separate robots.txt files for http and https, even if they belong to the same domain. But what are you supposed to do, rewrite to different robots.txt files depending on the robot's protocol? You might be able to do this on a brand-new site: Here are the rules for http, here are the ones for https. You can't really change at this point, though.

that means the secure server may be (and often should be) on a different subdomain and technically could easily be on a separate server or even a separate hosting service, so don't expect http://example.com/robots.txt to speak for the exclusions you intended for https://secure.example.com/ - even if they are the identical rules.
keep in mind the secure server and the non-secure server are distinct servers even if they happen use the same document root directory.

if in your specific implementation, the requests for http://example.com/robots.txt and https://example.com/robots.txt happen to serve the same file and that's sufficient for your requirements for both servers, then you're good to go.

helenp




msg:4558004
 1:52 pm on Mar 24, 2013 (gmt 0)

Starting thinking if the 301 I added to all pages is good,
as when google enter a https page and then keep crawling it will go to a http page using https but as the page has a 301 to http then google wont crawl the wrongly indexed duplicated files with https and then wont drop them either, or am I confused?

Just had a "brilliant" or "stupid" idea.
I added https as a site in google analytics, as it is imposible to delete files in my http site in google analytics as it autmatically add http in front.
So I entered my brand new https site and was going to delete an indexed https page, however then I started to think does google really see the page as a diferent page or will google delete both the https and the http page with equal content and path, only https and http that differ.
Any idea?

phranque




msg:4558050
 6:51 pm on Mar 24, 2013 (gmt 0)

google does in fact see the different urls as different pages when crawling although when indexing it might recognize it as duplicate content and treat them as one.
but you don't necessarily know if the correct url will be indexed, especially if your content is referring to the wrong one.
the 301 is the correct signal to tell google to use the correct protocol as googlebot will eventually recrawl those incorrectly indexed urls and replace them with the correct url.

helenp




msg:4558053
 7:03 pm on Mar 24, 2013 (gmt 0)

Thanks phranque,
So just have to sit down and wait,
suppose a mad idea to delete the https pages in webmaster tool, anyway I dont dar without any references about it.

lucy24




msg:4558091
 10:08 pm on Mar 24, 2013 (gmt 0)

but will google search for a robots.txt for ssl when changing from http to https?

It doesn't need to. The rule is set up as a rewrite, so the googlebot doesn't know it is getting a different file.

But, again, this is only necessary if you want to serve up different crawling rules for http and https. If you've got your 301 redirects in place, that should be all you need in the long term.

when indexing it might recognize it as duplicate content and treat them as one

I thought the essence of Duplicate Content was that it doesn't recognize two pages as one-- even in cases where an ordinary human can tell at a glance.

helenp




msg:4558229
 12:41 pm on Mar 25, 2013 (gmt 0)

puf, google will drive me mad...
before I had links like this as soft 404 error
page.htm/folder/folder/page2.htm
I changed from relative to root and did fixes in htaccess,
then the duplicated https and http appears, wich I am sitting down and waiting for be fixed.

Now in improvements html in duplicate title in webmaster tools
some pages like this apppeared:
/folder/page.htm/page2.htm
when I click on them all took me to page.htm
Now due to what is this?
This is a nightmare
Maybe this is due to that I had to add a new widget having to do with widget on page.htm so I add the new content to page.htm then on page.htm there is a link to the old content with a new filename, this I did as the page been many years in google and the new content is more important than the old, its a higher level content.
But that is stupid google should be able to handle that.

helenp




msg:4558235
 12:53 pm on Mar 25, 2013 (gmt 0)

there was another duplicated page, also stupid.
I had this
/mysiteaboutwidget.htm
this page exist,
and on this page there is a javascript to switch between celcius and fahrenheit
and there was a page like this with duplicated title, and this page does not exist, and I dont think anybody would link to the page like that
/mysiteaboutwidget.htm?fahrenheit=&celsius=

g1smd




msg:4558238
 1:08 pm on Mar 25, 2013 (gmt 0)

Late to the conversation, but an explanation of what your original htaccess does, and its limitations is in order...

The original code redirects:
- http index.html (but not index.php) requests to http, www and "/"
- https index.html (but not index.php) requests to https, www and "/"
- http requests for any hostname other than exactly www.example.com to http and www
- https requests for any hostname other than exactly www.example.com to https and www

This code is good, essential even, however there is nothing in there to force any particular page, using the URL format in your example, to be http or https. You should also add provision for .php to the index redirects.

There is a rule to redirect requests for example.com/file.php/something and example.com/folder/file.php/something to http (and the same for .htm/ and .html/ requests), but the pattern in that rule doesn't match your example URLs.

The usual method is to have www.example.com as http and store.example.com as https, or to define certain folders as https and the rest of the site as http. In your case, unless there's a simple rule you can add to the htaccess to take care of this, it is probably better to do the checking and redirecting from within your PHP script.

The rules should also change order: 1 - 2 - 3 - 4 - 5 should be 1 - 2 - 5 - 3 - 4.

[edited by: g1smd at 1:24 pm (utc) on Mar 25, 2013]

helenp




msg:4558242
 1:23 pm on Mar 25, 2013 (gmt 0)

This code is good, essential even, however there is nothing in there to force any particular page to be http or https.

The usual method is to have www.example.com as http and store.example.com as https, or to define certain folders as https and the rest of the site as http. In your case, it is probably better to do the checking and redirecting from within your PHP script.


Not sure, but I think that it is what I am doing wich was posted in previous post.
I dont have any folder for https, but I link like this to the https pages: https://www.mysite/page.php

Then as an emergency solution I have added this to
all pages that should be http:
if ( $_SERVER['HTTPS'] )
{
$host = $_SERVER['HTTP_HOST'];
$request_uri = $_SERVER['REQUEST_URI'];
$good_url = "http://" . $host . $request_uri;

header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $good_url" );
exit;
}

and one if not https and good url = https on those pages that should be https.

Is this what you mean? thanks

[edited by: helenp at 1:30 pm (utc) on Mar 25, 2013]

helenp




msg:4558243
 1:24 pm on Mar 25, 2013 (gmt 0)

The rules should change order: 1 - 2 - 3 - 4 - 5 should be 1 - 2 - 5 - 3 - 4.

You just added this, do you mean in the htaccess file?

lucy24




msg:4558328
 4:39 pm on Mar 25, 2013 (gmt 0)

From first post:
RewriteCond {not 443}
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.mysite.com/$1 [R=301,L]

RewriteCond {443}
RewriteRule ^(([^/]+/)*)index\.html?$ https://www.mysite.com/$1 [R=301,L]

RewriteCond {wrong domain, not 443}
RewriteRule (.*) http://www.mysite.com/$1 [R=301,L]

RewriteCond {wrong domain, 443}
RewriteRule (.*) https://www.mysite.com/$1 [R=301,L]

{get rid of extra path info}
RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php))/ http://www.mysite.com/$1 [R=301,L]


The domain-name-canonicalization rules (3 and 4) should come after absolutely all other redirects.

I think you said in another post that your site parses html as php, so you actually don't have pages in .php except in the https area?

helenp




msg:4558334
 4:56 pm on Mar 25, 2013 (gmt 0)

I think you said in another post that your site parses html as php, so you actually don't have pages in .php except in the https area?

The reason to parse is that there are pages more than 10 years old with extension .htm, not to change and have 301s.
So there are many pages ending with .php both in http and in https.
Pages with much php in I do with .php as its easier to have correcto coloring.

This 33 message thread spans 2 pages: 33 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved