Welcome to WebmasterWorld Guest from

Forum Moderators: martinibuster

Message Too Old, No Replies

Yahoo Site Explorer Bug Wrongly Handling https/http



10:42 am on May 13, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


I have site that has some CGI scripts that use https. Yahoo has picked up a link from somewhere and thinks I have two sub domains http://www.example.com and [example.com....]

When I try to delete [example.com...] in SiteExplorer it then wants to delete all of my site, ie all of the pages that use http as well as the [example.com...] index.

As a result of this Yahoo seems to be applying a dupe penalty and my site, which used to be at #1 is now around #50 for the main target keyword term.

Is there a simple 301 redirect solution that will sort this out without harming my ranking elsewhere, I'm #2 on Google for this term, or am I better off allowing Yahoo to wither on the vine?

Many thanks for any pointers.




11:24 am on May 14, 2008 (gmt 0)

10+ Year Member

Hi Sid,

I've had this exact same problem for over a year, I've also got a robot text file on my Https site that denies indexing yet Yahoo still indexes the https as a sub-domain.
As for a redirect i have a server wide 301 redirect from non www to www for years now and still Yahoo indexes some of my url's with the non www prefix. Yahoo's capability of displaying and following 301 redirects is very poor in my experience and they seem to display url's from links rather than what the server is telling the Bot to do.
try and locate the offending page that links in to the https site is the only solution that i can suggest.
I'd greatly appreciate any help if anyone else has some detailed information on how to remove these as Yahoo has a strong following still in some parts of the world.



1:00 pm on May 14, 2008 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member

The use of the word "subdomain" in this thread is misleading. No matter what Yahoo is displaying or how it is displaying it, [example.com...] is not in any way a subdomain of http://example.com. HTTPS is a different protocol than HTTP, and is not related to domains or subdomains at all.

Best practices to avoid this problem:

1) When linking from a secure page to a non-secure page, use a canonical URL, e.g. <a href="http://example.com/non-secure-page">, rather than page-relative or server-relative links (don't use <a href="non-secure-page">, <a href="../non-secure-page"> or <a href="/pages/non-secure-page"> )

2) When linking from a non-secure page to a secure page, again use a canonical URL., e.g. <a href="https://exmaple.com/non-secure-page">

3) Server-side, detect non-secure (i.e. HTTP) access requests to secure pages, and redirect them to change the protocol to HTTPS.

4) Server-side, detect secure (i.e. HTTPS) access requests to non-secure pages, and redirect them to change the protocol to HTTP.

5) If your secure and non-secure pages are actually stored on the server in different filespaces, then use the robots.txt files to tell search engines to stay away from https unless its appropriate.

Points 3 and 4 are easy if you tag your secure and non-secure page URLs in some way, for example, making them appear to be in separate directories under your site root. This makes determining which protocol should be used to access them very easy using Apache mod_rewrite or ISAPI Rewrite on IIS.



2:30 pm on May 14, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Many thanks for taking the trouble to reply.

Vimes I've been through with a fine tooth comb and did find a couple of relative links that were in effect pointing to [....] I've cleaned these out.

JD I hoped that you might pick up on this thread. I think that, from what you have said, that my best solution might be to do the server side detection that you detail in 4. I have a snippet included in my Perl cgi files like so.

$inboundurl = $ENV{'SCRIPT_URI'};

$inboundurl =~ s/http:/https:/;

if ($ENV{'SCRIPT_URI'} =~ "http:") {
print "Location:$inboundurl \n\n";

This is easy on the Perl scripts but until recently would have been impossible (I think) on what were static html files in the rest of the site. I have recently started to change my static pages and have used the instructions that you gave elsewhere to have .html files parsed as php, mainly to do simple includes. I've not changed all of my pages over yet but will be doing so in the next couple of weeks. I am therefore now in a position to include some code to detect the request protocol in my pages.

I hope that you don't mind me asking a couple of cheeky questions. Please feel free to tell me to find out for myself, but if you have answers to hand they would be very much appreciated.

Can you suggest some PHP to do the opposite of what I am doing above?

Do you think that Yahoo might misinterpret this as cloaking?

Many thanks


PS I've pointed out to Yahoo that http and https are just different protocols. They suggested that this was a bug in SiteExplorer.


9:42 pm on May 16, 2008 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

There's a related thread in the Google Search forum right now, with a good approach that would work for all search engines. It's not just Yahoo that treats different protocols as different URLs when they both resolve.



10:15 pm on May 16, 2008 (gmt 0)

WebmasterWorld Senior Member pageoneresults is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

On IIS, you can nip this one in the bud real quick in regards to http vs https getting indexed...

RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]

The robots.https.txt file...

User-agent: *
Disallow: /

All of the SEs have had challenges in this area. And so have Webmasters. Its an ongoing challenge as many are just becoming aware of these types of issues. I remember doing site: searches in Google and finding home pages indexed under the https protocol.


2:48 am on May 17, 2008 (gmt 0)

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

I think the proper way to do this is through your httpd.conf file where you defined VirtualHosts. If you use different document roots for the different protocols (identified by listen port, 80 vs. 443) then the only files which can be accessed via https:// will be the secure files.


3:38 am on May 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

Nice call, Ted, this error has been cropping up on websites I have seen lately.


3:22 pm on May 18, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member


Many thanks to all that contributed to this thread.

I've been away for a couple of days but have now implemented the second robots.txt file and .htaccess solution as suggested. This seems like a very simple and elegant solution. Paranoia makes me hope that it doesn't have any unexpected results.

Thanks again


PS I'll report back in a couple of weeks.


8:17 am on May 19, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member

As if by magic, this morning there are no https URLs listed for my site in SiteExplorer except when I click on the 2 sub domains link the site root is still there as an https. When I "Explore" that there are no inlinks and no other https's listed.

Hopefully the penny will fully drop with Yahoo in due course.




Featured Threads

Hot Threads This Week

Hot Threads This Month