Allowing robots to index secure directories in root

Forum Moderators: goodroi

Message Too Old, No Replies

Allowing robots to index secure directories in root

naomi

10:51 pm on Mar 30, 2005 (gmt 0)

Is there a way to allow googlebot, for example, to have access to folders that are in a secure part of a web server and that typically require a specific level of access to login?

Robert Paulson

3:40 am on Mar 31, 2005 (gmt 0)

I was just looking for an answer to a similar question - I run a vB forum and am looking for a way to allow spiderbots to index forums that a non-registered user can't access. However, there are also forums I don't want to allow the bots access to. I believe I've seen the end result of this once, so I'm hoping the smart folks here might aim me in the right direction.

naomi

2:03 pm on Mar 31, 2005 (gmt 0)

I think the second part of your question can be resolved by a /robots.txt file in the html part our your website. I am guessing you would write the following:

User-agent: *
Disallow: /cgi-bin/ #or whatever the physical path is to the forums that you do not want spidered.

This is just a guess though. Jury is still out on our original query.

Robert Paulson

4:39 am on Apr 2, 2005 (gmt 0)

Has this never been done before, or is it beyond the coding ability of the folks here?

*Kinda hoping that charges up someone enough to show off their talents*

naomi

7:03 pm on Apr 5, 2005 (gmt 0)

I can't believe no one is answering! I know it can be done because Google News is the one that suggested it to me.

encyclo

7:22 pm on Apr 5, 2005 (gmt 0)

It can be done, just not with a robots.txt - the only thing that robots.txt does is limit access, not extend it.

The method you need is to use cloaking to detect whether the visitor is a spider (from known IP addresses) and allow it to access the content, whereas normal visitors are redirected to a login page. You will need to build or buy a cloaking solution and adapt your forum software in consequence.

Bear in mind that you should also disallow the caching of the page (otherwise a user could click on the Cache link next to the serp to view the content without logging in). You should also consider the reaction of the visitor when attempting to access your content and being faced with a request to log in: you will probably find that the majority of users will hit "back" rather than register to view the page. Cloaking can also be seen as risky in that it may increase the risk of your site being banned.

jatar_k

7:23 pm on Apr 5, 2005 (gmt 0)

I dont really see the point to be honest.

if you need a certain level of access why expose them to spiders?

we would call this cloaking, ip based content delivery, if the ip belongs to a robot then just let them in.

but there are a couple issues here, the first being my above question and the second

cloaking is not easy, straight forward, for the feint hearted or allowed. The guidelines in various search engines specifically mention cloaking as a no no. This doesn't mean that you cant do it, it just means that you should really understand the risks of what you are doing.

<added>and encyclo wins by a nose ;)

Robert Paulson

8:15 pm on Apr 5, 2005 (gmt 0)

Thanks for the responses!

The only reason I thought of this was I did a Google search awhile back and got a result for a vB forum page. I could view that cached page, but I couldn't access any of the other pages in that thread, or the rest of the forum. However, if I ran similar searches, I could find other Google cached pages that I could view, but again, I was only able to view those pages. (whether I could only view the cached page and not the on-site page, I don't recall)

The information was relevant enough that I registered as a member to get full access, so I thought it was a pretty slick idea.

Would this also be a cloaking mechanism? Thanks in advance for the responses. And thanks to Naomi for keeping this thread alive. :)

jatar_k

8:32 pm on Apr 5, 2005 (gmt 0)

>> Would this also be a cloaking mechanism

i would think so

naomi

8:52 pm on Apr 5, 2005 (gmt 0)

Thank you, thank you, thank you for all of your guidance!

Ok, now that I have gotten my thanks out of the way...

RE: >> If you need a certain level of access why expose them to spiders?

We are a proprietary news site and want Google News to index our content for inclusion in Google News Search.

They have accepted our site but said the following:

"In order to add your news articles to Google News, our crawler needs to be able to access the content on your site. Currently, crawlers cannot fill out registration forms, nor do they support cookies. Given that, in order to successfully crawl your site, we need to be able to circumvent your registration page. The easiest way to do this is to configure your webservers to not serve the registration when our crawlers visit your pages (when the User-Agent is "Googlebot"). You can verify that the request is actually from our robot by making sure the IP address is within the range of 66.249.64.0/20."

RE: >> You should also consider the reaction of the visitor when attempting to access your content and being faced with a request to log in: you will probably find that the majority of users will hit "back" rather than register to view the page.

Visitors can surf our site but because we are a proprietary news site they must purchase access to view the complete article rather than just the abstract (which is visible to all)

How this relates to this particular post is that we want Google News to crawl the entire article, not just the abstract. So I've been looking for a way to get around it.

I didn't realize that the robots.txt file could only disallow access - now I know and will seek knowledge elsewhere -

phew...that was a long post

Thanks again!

naomi

8:58 pm on Apr 5, 2005 (gmt 0)

A cloaking mechanism was mentioned in previous posts but what if the content is in a folder that allows indexing but is on a secure part of the server - what they call the 'root', rather than in the 'html' part of the site?

Is there another way around to allow access besides cloaking?

jatar_k

9:30 pm on Apr 5, 2005 (gmt 0)

can it be accessed via the web?

if it has an actual url then the spider can grab it

naomi

2:46 pm on Apr 14, 2005 (gmt 0)

Yeah, I'll have to test that out - the physical path that is.