homepage Welcome to WebmasterWorld Guest from 54.197.94.241
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
Forum Library, Charter, Moderators: martinibuster

Yahoo Search Engine and Directory Forum

    
Yahoo Site Explorer Bug Wrongly Handling https/http
Hissingsid




msg:3648730
 10:42 am on May 13, 2008 (gmt 0)

Hi,

I have site that has some CGI scripts that use https. Yahoo has picked up a link from somewhere and thinks I have two sub domains http://www.example.com and https://www.example.com.

When I try to delete https://www.example.com in SiteExplorer it then wants to delete all of my site, ie all of the pages that use http as well as the https://www.example.com index.

As a result of this Yahoo seems to be applying a dupe penalty and my site, which used to be at #1 is now around #50 for the main target keyword term.

Is there a simple 301 redirect solution that will sort this out without harming my ranking elsewhere, I'm #2 on Google for this term, or am I better off allowing Yahoo to wither on the vine?

Many thanks for any pointers.

Cheers

Sid

 

Vimes




msg:3649614
 11:24 am on May 14, 2008 (gmt 0)

Hi Sid,

I've had this exact same problem for over a year, I've also got a robot text file on my Https site that denies indexing yet Yahoo still indexes the https as a sub-domain.
As for a redirect i have a server wide 301 redirect from non www to www for years now and still Yahoo indexes some of my url's with the non www prefix. Yahoo's capability of displaying and following 301 redirects is very poor in my experience and they seem to display url's from links rather than what the server is telling the Bot to do.
try and locate the offending page that links in to the https site is the only solution that i can suggest.
I'd greatly appreciate any help if anyone else has some detailed information on how to remove these as Yahoo has a strong following still in some parts of the world.

Vimes.

jdMorgan




msg:3649665
 1:00 pm on May 14, 2008 (gmt 0)

The use of the word "subdomain" in this thread is misleading. No matter what Yahoo is displaying or how it is displaying it, https://example.com is not in any way a subdomain of http://example.com. HTTPS is a different protocol than HTTP, and is not related to domains or subdomains at all.

Best practices to avoid this problem:

1) When linking from a secure page to a non-secure page, use a canonical URL, e.g. <a href="http://example.com/non-secure-page">, rather than page-relative or server-relative links (don't use <a href="non-secure-page">, <a href="../non-secure-page"> or <a href="/pages/non-secure-page"> )

2) When linking from a non-secure page to a secure page, again use a canonical URL., e.g. <a href="https://exmaple.com/non-secure-page">

3) Server-side, detect non-secure (i.e. HTTP) access requests to secure pages, and redirect them to change the protocol to HTTPS.

4) Server-side, detect secure (i.e. HTTPS) access requests to non-secure pages, and redirect them to change the protocol to HTTP.

5) If your secure and non-secure pages are actually stored on the server in different filespaces, then use the robots.txt files to tell search engines to stay away from https unless its appropriate.

Points 3 and 4 are easy if you tag your secure and non-secure page URLs in some way, for example, making them appear to be in separate directories under your site root. This makes determining which protocol should be used to access them very easy using Apache mod_rewrite or ISAPI Rewrite on IIS.

Jim

Hissingsid




msg:3649728
 2:30 pm on May 14, 2008 (gmt 0)

Many thanks for taking the trouble to reply.

Vimes I've been through with a fine tooth comb and did find a couple of relative links that were in effect pointing to https://. I've cleaned these out.

JD I hoped that you might pick up on this thread. I think that, from what you have said, that my best solution might be to do the server side detection that you detail in 4. I have a snippet included in my Perl cgi files like so.

##################################
$inboundurl = $ENV{'SCRIPT_URI'};

$inboundurl =~ s/http:/https:/;

if ($ENV{'SCRIPT_URI'} =~ "http:") {
print "Location:$inboundurl \n\n";
}
##################################

This is easy on the Perl scripts but until recently would have been impossible (I think) on what were static html files in the rest of the site. I have recently started to change my static pages and have used the instructions that you gave elsewhere to have .html files parsed as php, mainly to do simple includes. I've not changed all of my pages over yet but will be doing so in the next couple of weeks. I am therefore now in a position to include some code to detect the request protocol in my pages.

I hope that you don't mind me asking a couple of cheeky questions. Please feel free to tell me to find out for myself, but if you have answers to hand they would be very much appreciated.

Can you suggest some PHP to do the opposite of what I am doing above?

Do you think that Yahoo might misinterpret this as cloaking?

Many thanks

Sid

PS I've pointed out to Yahoo that http and https are just different protocols. They suggested that this was a bug in SiteExplorer.

tedster




msg:3652207
 9:42 pm on May 16, 2008 (gmt 0)

There's a related thread in the Google Search forum right now, with a good approach that would work for all search engines. It's not just Yahoo that treats different protocols as different URLs when they both resolve.

[webmasterworld.com...]

pageoneresults




msg:3652224
 10:15 pm on May 16, 2008 (gmt 0)

On IIS, you can nip this one in the bud real quick in regards to http vs https getting indexed...

RewriteCond %HTTPS ^on$
RewriteRule /robots.txt /robots.https.txt [I,O,L]

The robots.https.txt file...

User-agent: *
Disallow: /

All of the SEs have had challenges in this area. And so have Webmasters. Its an ongoing challenge as many are just becoming aware of these types of issues. I remember doing site: searches in Google and finding home pages indexed under the https protocol.

vincevincevince




msg:3652348
 2:48 am on May 17, 2008 (gmt 0)

I think the proper way to do this is through your httpd.conf file where you defined VirtualHosts. If you use different document roots for the different protocols (identified by listen port, 80 vs. 443) then the only files which can be accessed via https:// will be the secure files.

CainIV




msg:3652357
 3:38 am on May 17, 2008 (gmt 0)

Nice call, Ted, this error has been cropping up on websites I have seen lately.

Hissingsid




msg:3653105
 3:22 pm on May 18, 2008 (gmt 0)

Hi,

Many thanks to all that contributed to this thread.

I've been away for a couple of days but have now implemented the second robots.txt file and .htaccess solution as suggested. This seems like a very simple and elegant solution. Paranoia makes me hope that it doesn't have any unexpected results.

Thanks again

Sid

PS I'll report back in a couple of weeks.

Hissingsid




msg:3653463
 8:17 am on May 19, 2008 (gmt 0)

As if by magic, this morning there are no https URLs listed for my site in SiteExplorer except when I click on the 2 sub domains link the site root is still there as an https. When I "Explore" that there are no inlinks and no other https's listed.

Hopefully the penny will fully drop with Yahoo in due course.

Cheers

Phil

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Yahoo / Yahoo Search Engine and Directory
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved