http:// and https:// - Robots.txt File - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

http:// and https:// - Robots.txt File

Need help

bilalseo

8:32 am on Nov 21, 2009 (gmt 0)

Dear Readers and Webmasters

I was suffering through a problem which occured while search engine ranking. My ranking are dropped from first page to infinity. The ranking was stable with the domain http://www.example.com and suddenly after a fresh crawl it went down and site comes up with https:// in google .. I don't know why google indexed https:// the whole ranking went down and down... :(

My question is that what to do with that to direct google to crawl the old domain again and restrict the google to stop visiting and indexing the domain with https://

One option is with robots.txt file.. I'm not keen to use robots.txt file, and if this will deal with robots.txt file then what useragent will be used.

Another option is to wait to for a sunny day ;)

Will be looking forward to your replies on it... guys I'm very disturbed.

Thanks,

Bilal

[edited by: engine at 4:20 pm (utc) on Nov. 23, 2009]
[edit reason] examplified [/edit]

jdMorgan

1:32 pm on Nov 21, 2009 (gmt 0)

Usually, http and https are set up as two different 'servers' or 'accounts' in your web hosting. If this is the case, then put a robots.txt file into your https/SSL server with

# Disallow all robots from fetching all resources
User-Agent: *
Disallow: /

Another option is to detect unwanted requests for https and redirect them using a 301-Moved Permanently redirect to the canonical URL using the http protocol instead.

How you'd do that depends on your server type and version, your coding skills, and your preferences.

Why did Google index those URLs? Because you allowed it to do so, and someone, somewhere linked to the https 'version' of your site -- either accidentally or maliciously.

Jim

dstiles

9:40 pm on Nov 21, 2009 (gmt 0)

I force all my mixed SSL/non-SSL sites to non-SSL with the Canonical tag. Seems to work.

I also detect the user-agent and if it's not obviously a browser I kill SSL links. As far as I'm concerned my sites' SSL pages should not be visible until the ordering process begins, so they are of no concern to SEs.

bilalseo

4:14 pm on Nov 23, 2009 (gmt 0)

Thanks Guys.. for a great and useful replies.. :) thanks a lot

Bilal

enigma1

12:43 pm on Nov 30, 2009 (gmt 0)

I don't like the idea of forcing redirects from SSL to non-SSL even for spiders as they will index the page with http. Because secure scripts have to run in https and should have specific code to verify this.

In other words if I run a store, I don't want a customer to get in the create account page in http. So one way to get around it is to use the meta-tags for noindex/nofollow and rel=nofollow property to the links that point to various SSL pages I don't want SEs to index. Or some text form instead of links so spiders cannot follow.

bilalseo

6:20 pm on Nov 30, 2009 (gmt 0)

Thanks Guys.. the problem is fixed by adding two different robots.txt file... I have used one rotbos.txt file with allow option and other was used to tell sppiders to don't index pages with https:// .. this has also resolved with one robots.txt file .. but due to bothering... I just used two different rotbots.txt file to avoid bothering.. :)

Thanks Guys

Bilal

dstiles

10:01 pm on Nov 30, 2009 (gmt 0)

Enigma1: Note my second paragraph.

Basically, on my sites only sensitive pages (eg payment) are fed via SSL. Any other page (eg product, "about") is publicly visible and fed via non-SSL and available to visitors and SE bots.

Exceptions are T&C, AUP and suchlike which I feel are seldom of any business to SEs. Hence I remove links to those pages entirely and to (eg) carts and payment pages if I sniff an SE bot IP - in fact anything that does not seems to be a browser.

Customers are switched between SSL/non-SSL according to page sensitivity so should never see (eg) a standard product page via SSL and vice versa.

The canonical is to ensure that if some toolbar (whatever) feeds an SSL page to an SE and it tries to follow it with a bot then the page is correctly reassigned to non-SSL. If the bot tries to read a cart or payment page it gets fed a 405 or similar. Ditto for most contact forms.

You cannot bet on SEs getting it right. Some of them (alright, ALL of them) are sometimes very invasive and one has to resort to "firewall" and other methods of rebuttal.