homepage Welcome to WebmasterWorld Guest from 54.225.24.227
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
How To stop crawler to crawl https:// pages
vreelmistee1




msg:3585965
 10:20 am on Feb 27, 2008 (gmt 0)

How To stop crawler to crawl https:// pages?

 

webdoctor




msg:3587359
 6:26 pm on Feb 28, 2008 (gmt 0)

Q: Are you serving identical content via both http and https?

If I visit your site and request http://www.example.com/foo.html is the very same content also available at https://www.example.com/foo.html?

vreelmistee1




msg:3589679
 6:18 am on Mar 3, 2008 (gmt 0)

hi, Webdoctor,
Ya, when in google some of my pages with index page crawled with http:// and https:// also.
Ex: [mydomain.co.uk...]
https://www.mydomain.co.uk/
[mydomain.co.uk...]
https://www.mydomain.co.uk/product1.html

Above both url contain same content.
So how can i stop google to crawl my page with https://
and is there any bad affect to my site or my ranking with this type of issue?

incrediBILL




msg:3615096
 6:29 am on Mar 31, 2008 (gmt 0)

If I follow the problem you probably have ecommerce on your site and allow the bots to crawl your shopping cart page and/or checkout page which is where this error typically starts.

From an old Google web page, assuming this is still accurate:

Each port must have its own robots.txt file. In particular, if you serve content via both http and https, you'll need a separate robots.txt file for each of these protocols. For example, to allow Googlebot to index all http pages but no https pages, you'd use the robots.txt files below.

For your http protocol (http://yourserver.com/robots.txt):
User-agent: *
Allow: /

For the https protocol (https://yourserver.com/robots.txt):
User-agent: *
Disallow: /

However, this is a problem if your HTTP and HTTPS share the same root directory and would require a small PERL or PHP script to serve up the proper robots.txt file depending on whether or not the secure server was being used.

incrediBILL




msg:3615097
 6:31 am on Mar 31, 2008 (gmt 0)

BTW, another solution is to conditionally add the robots meta tag into your pages being served by the HTTPS server to contain "NOINDEX,NOFOLLOW"

bilalseo




msg:3643177
 6:41 pm on May 6, 2008 (gmt 0)

agree with the last post. :) the best thing is that to use meta tags (noindex, nofollow)

jdMorgan




msg:3643197
 6:56 pm on May 6, 2008 (gmt 0)

You could also use mod_rewrite to detect the protocol and serve an alternate robots.txt file for HTTPS.

And while you're at it, add some rules so that HTTPS pages are redirected if requested via HTTP, and HTTP pages are redirected if requested via HTTPS. Just one of many "canonicalizations" you should do so that each page on your site is directly-accessible by one and only one URL...

Jim

vreelmistee




msg:3643584
 6:03 am on May 7, 2008 (gmt 0)

Hey jdMorgan,

Can you please explain which rule and how to write it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved