homepage Welcome to WebmasterWorld Guest from 184.72.82.126
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Best way to prevent indexing of https pages
hippypink




msg:4035786
 11:46 pm on Dec 2, 2009 (gmt 0)

Some things I think might work, with pro's and cons, but to be honest, I dont know how to do some of them.

1. Using a wildcard in robots.txt file. Probably the easiest, however, this is only respected by Google (and not Bing / Yahoo). Robots.txt isnt always reliable either.

2. Writing a php script which dynamically inserts meta tags noindex, nofollow. I dont want to have to edit an entire site to add stuff like this.
Similarly, using php to deliver two different robots.txt file (one for secure, and one for non-secure). Again, way too much work.

3. Using 301 redirects. Even messier.

4. Placing https site in different folder. Not really an option for me.

5. X-Robots-Tag in htaccess. This would be great, if I knew an htaccess rule that allowed me to do this. I am fairly novice to htaccess stuff.

Can someone provide a working example?

 

goodroi




msg:4036895
 12:49 pm on Dec 4, 2009 (gmt 0)

this older thread will probably give you some good insights
[webmasterworld.com...]

hippypink




msg:4037179
 7:21 pm on Dec 4, 2009 (gmt 0)

Yes, that is great info, and I just implemented it.
However, it is known that google will still index a given page if there are enough links to it, and ignore the robots.txt rules.

Perhaps is this is a good thread to discuss the X-Robots-Tag ? and how to implement it for the same purpose? just so theres some documentation on the subject.

goodroi




msg:4037650
 2:08 pm on Dec 5, 2009 (gmt 0)

i think there might be some confusion.

google does honor and abide by robots.txt and will not crawl or index a page if it is properly disallowed.

if google does discover links pointing to a url they will include the url in their search index but will not crawl or index it if that url is disallowed by robots.txt.

url discovery is very different from crawling and indexing. by finding links on other websites and looking at the anchor text, plus incorporating url listing information from Yahoo, Best of the Web and the DMOZ directory you can learn alot about a url without ever having to crawl or index the url. this is one way to abide by robots.txt and still include as many urls in the search index as possible.

hippypink




msg:4038063
 8:04 am on Dec 6, 2009 (gmt 0)

[yoast.com...]


Crawler directives
The robots.txt file only contains the so called Crawler directives, telling search engines, identified by their User-agent:, where they are not allowed to go by using Disallow: and where they can (and should) go by using Allow:, and by pointing them at a Sitemap:.

As Sebastian pointed out and explains thoroughly in another brilliant post, pages that search engines aren't allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search, robots.txt just isn't good enough.

[edited by: goodroi at 2:41 pm (utc) on Dec. 6, 2009]
[edit reason] Please do not republish entire blog posts [/edit]

goodroi




msg:4038158
 2:35 pm on Dec 6, 2009 (gmt 0)

having urls show up in the search results is very different from ignoring robots.txt. google does follow robots.txt. technically google has followed robots.txt 99.9% of the time because there has been the occasional programming glitch with googlebot. these involve very rare and unique situations. once google was notified of them they worked to fix them so they would handle the robots.txt properly the next time.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved