Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

How to remove already cached https pages

         

AjiNIMC

5:16 am on Jan 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Some of my https page got cached because of a small bug (ignorance) with robots.txt, this can act as duplicate content. I will like to remove it from the Google listing, whats the fastest way.

I have done the following:

  1. Added a robots.txt for https blocking all content.
  2. Added entry in webmaster tool section for removal of the pages (but still it is under pending status). We had to add it as a new domain as mentioned in google webmaster tool help.

What else can be done to remove these pages from the cache to avoid duplicate content penalties.

Thanks,
AjiNIMC

Robert Charlton

5:45 am on Jan 12, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



In my experience, robots.txt offers limited help for this issue... and if Google is already seeing "https-creep" throughout your site, you're going to need to use mod_rewrite (on Apache) or ISAPI_Rewrite on Windows.

There's section on Duplicate Content in Hot Topics, which pinned to the top of the Google Search Forum home page. Take a look particularly at this thread....

HTTPS versus HTTP [webmasterworld.com] - one more duplicate area

AjiNIMC

5:56 am on Jan 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



mod_rewrite (on Apache)

Yeah, I used .htaccess to get a different robots.txt for https and it is working fine now.

I will check the topic. I can't do any canonicalization as I want both http and https to appear with the same content for a better user experience.

Let me check the topic before adding more to it.

AjiNIMC

5:57 am on Jan 12, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I saw the topic, it speaks the same stuff but what is a better way of removing it from cache and listing?

Thanks for the help.
AjiNIMC

Robert Charlton

7:20 pm on Jan 12, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I can't do any canonicalization as I want both http and https to appear with the same content for a better user experience.

If you insist on this, you won't have a better search engine experience. Same content on more than one url is essentially the definition of duplicate content. I'd read through all those Hot Topics articles on dupe content carefully.

...what is a better way of removing it from cache and listing?

Again... mod_rewrite.

AjiNIMC

3:20 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you insist on this, you won't have a better search engine experience. Same content on more than one url is essentially the definition of duplicate content. I'd read through all those Hot Topics articles on dupe content carefully.

I sometimes wonder (if they are not doing already) why can't search engines understand the common mistakes a webmaster might do with dups content. Last time I read Matt, he said they do put intelligence at that level.

Again... mod_rewrite.

There will be some wait for it, right?

Thanks for the replies.

AjiNIMC

Miamacs

3:44 pm on Jan 13, 2008 (gmt 0)

10+ Year Member



robots.txt will only stop Google from crawling the page ie. refreshing the data in its cache. If it's already in there, it'll stay there even if you add the disallow.

URL removal only works with URLs that are otherwise served ( or not served ) in order NOT to be indexed ( eg. 404/410, ROBOTS NONE metas and such ), and is only to speed up the process, but won't initiate it. They list robots.txt as one of the 'proper' signals, but... experience shows that it's much slower than in-page directives or NOT FOUND status codes.

if you don't like mod_rewrite, here's an html / programming solution...

1.: remove the robots.txt disallows so that Google takes notice of the changes...
2.: and add NOINDEX, NOARCHIVE or any other synonyms ( dynamically )
3.: when Google crawls the pages they'll drop out
4.: if cache is updated /or indexed dupes don't fall out ( they will tho ) THEN use the URL removal tool
5.: to be on the safe side, and to save bandwidth, once they're out you could add the robots.txt disallows again

...

Robert Charlton

6:50 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I sometimes wonder (if they are not doing already) why can't search engines understand the common mistakes a webmaster might do with dups content.

If you're serving up both versions of the content intentionally, how can they possibly know?

By adding the WMT preferences, Google has taken one step toward reading your mind... but you've still got to contend with site visitors who might want to link to you. If multiple canonical versions of your site are available, visitors who like your site are likely to link to the version of your site that they happen to see. This helps perpetuate the error. If you've successfully blocked search engine bots from the https version but visitors still see it, you risk splitting your inbound link votes.

Going back to your previous comment about user experience...

I can't do any canonicalization as I want both http and https to appear with the same content for a better user experience.

It seems what you really want is for the user to find your site whether they type in https or http. This is, in fact, what permanent redirection with mod_rewrite would accomplish. What it would change is that the incorrect version doesn't appear in the address window. With proper setup, pages that are supposed to be https would continue to be https, and pages that are supposed to be http would be displayed that way.

With a proper setup of DNS and mod_rewrite, you can even correct typos in the number of w's in "www," so "ww" or "wwww" eg, would be rewritten to "www". Etc....

AjiNIMC

7:06 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you're serving up both versions of the content intentionally, how can they possibly know?

match the content, if it is serving same content then consider it to be same without imposing penalties.

It seems what you really want is for the user to find your site whether they type in https or http

Its not about it but about the avoid IE warning when you shift from http to https and vice versa.

AjiNIMC

tedster

7:15 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



match the content, if it is serving same content then consider it to be same without imposing penalties.

You're not getting a penalty when PR is split between two URLs with the same content. You're just getting back a true reflection of what your server is doing. Any search engine needs to rank by the url, and not just by the "content". This is the technical nature of the web. Clarifying the technical side of your website is important if you want to be "heard" clearly and unambiguously.

Robert Charlton

11:17 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Its not about it but about the avoid IE warning when you shift from http to https and vice versa.

I assume you're talking about the warning message that says that the name on the security certificate is invalid or does not match the name on the site, and asks you to click yes if you want to proceed.

This warning is most likely an indicator that you have other canonical problems on your site as well. Eg, if you had a www canonical issue and had bought your certificate for [example.com...] the certificate would not be valid on [example.com...] and you would get the message for anyone accessing your secure pages without the www --

Here's a thread that discusses that in more detail:

SSL Certificate problems
SSL not showing correctly
[webmasterworld.com...]

This is another reason to clean up your canonical issues.

g1smd

11:56 pm on Jan 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have used the "alternative https/ssl robots file served using a rewrite" method before. It is not ideal, in several ways.

It doesn't prevent people seeing the "wrong" URL in their browser and then copying and pasting it and posting it as a link elsewhere.

Vimes

4:03 am on Jan 14, 2008 (gmt 0)

10+ Year Member



Hi,

If its not about the typing of https:// or http:// and all you want to do to is speed the removal of these https:// pages from the indices, its a little bit more work of course but you could make the secure site a sub domain [secure.example.com...] and place a new disallow all robots.txt file on that, and make the [example.com...] or [example.com...] return a 404/410 page. This will be a lot quicker in the removal of those files.
But it does sound as if you have other canonical issues that should also be fixed while you are doing this.

Vimes.

AjiNIMC

5:24 am on Jan 14, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you go to IE >> Tools >> Option >> Advance and then check this option under security

"warn if changing between secure and not secure mode"

Then when you click on https link from http pages it say, "You are about to view pages over a secure connection" (Click here to continue or something)

Then when you click on http from https pages it says
"You are about to leave a secure Internet connection. It will be possible for others to view information you send"

For a customer who has confusion in understanding what a browser is, it is a confusing stuff. As a marketer I will like to avoid such things as much as possible. Since I am dealing with a lot of $ on the site, it makes it even more painful to hear when customers tell you that your website is giving errors (which are basically these warnings).

AjiNIMC

Robert Charlton

5:45 am on Jan 14, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you go to IE >> Tools >> Option >> Advance and then check this option under security

AjiNIMC - You can't assume that any of your customers will go into IE and check or uncheck anything. We're suggesting you fix it on your server.

I feel your pain about how steep the learning curve on all this stuff is. You do need to do that canonicalization you're resisting. It's the only dependable way I know of to do it. I've also had no luck with the "https/ssl robots file" approach.

One thing I should add, btw... the images on your secure pages also need to be on a secure server. If your pages are on a secure server but their images aren't, that would also trigger the warning message.

AjiNIMC

6:16 am on Jan 14, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Robert you are not getting me. I am not getting any error messages either on http or https but when people shift from one page to other it creates a problem.

Example people search for XYZ and lands on http pages where they are clicking on /signup/ on http itself, there is no warning sign.

People search for XYZ and lands on http pages where they are clicking on /signup/ on https then it prompts the above message which sometimes scares your visitor resulting in an abort.

I am in some urgent meeting so not able to post in details, will do that later tonight in a detailed way.

Thanks,
AjiNIMC