Forum Moderators: goodroi
User-agent: *
Disallow: /cgi-bin/ #or whatever the physical path is to the forums that you do not want spidered.
This is just a guess though. Jury is still out on our original query.
The method you need is to use cloaking to detect whether the visitor is a spider (from known IP addresses) and allow it to access the content, whereas normal visitors are redirected to a login page. You will need to build or buy a cloaking solution and adapt your forum software in consequence.
Bear in mind that you should also disallow the caching of the page (otherwise a user could click on the Cache link next to the serp to view the content without logging in). You should also consider the reaction of the visitor when attempting to access your content and being faced with a request to log in: you will probably find that the majority of users will hit "back" rather than register to view the page. Cloaking can also be seen as risky in that it may increase the risk of your site being banned.
if you need a certain level of access why expose them to spiders?
we would call this cloaking, ip based content delivery, if the ip belongs to a robot then just let them in.
but there are a couple issues here, the first being my above question and the second
cloaking is not easy, straight forward, for the feint hearted or allowed. The guidelines in various search engines specifically mention cloaking as a no no. This doesn't mean that you cant do it, it just means that you should really understand the risks of what you are doing.
<added>and encyclo wins by a nose ;)
The only reason I thought of this was I did a Google search awhile back and got a result for a vB forum page. I could view that cached page, but I couldn't access any of the other pages in that thread, or the rest of the forum. However, if I ran similar searches, I could find other Google cached pages that I could view, but again, I was only able to view those pages. (whether I could only view the cached page and not the on-site page, I don't recall)
The information was relevant enough that I registered as a member to get full access, so I thought it was a pretty slick idea.
Would this also be a cloaking mechanism? Thanks in advance for the responses. And thanks to Naomi for keeping this thread alive. :)
Ok, now that I have gotten my thanks out of the way...
RE: >> If you need a certain level of access why expose them to spiders?
We are a proprietary news site and want Google News to index our content for inclusion in Google News Search.
They have accepted our site but said the following:
"In order to add your news articles to Google News, our crawler needs to be able to access the content on your site. Currently, crawlers cannot fill out registration forms, nor do they support cookies. Given that, in order to successfully crawl your site, we need to be able to circumvent your registration page. The easiest way to do this is to configure your webservers to not serve the registration when our crawlers visit your pages (when the User-Agent is "Googlebot"). You can verify that the request is actually from our robot by making sure the IP address is within the range of 66.249.64.0/20."
RE: >> You should also consider the reaction of the visitor when attempting to access your content and being faced with a request to log in: you will probably find that the majority of users will hit "back" rather than register to view the page.
Visitors can surf our site but because we are a proprietary news site they must purchase access to view the complete article rather than just the abstract (which is visible to all)
How this relates to this particular post is that we want Google News to crawl the entire article, not just the abstract. So I've been looking for a way to get around it.
I didn't realize that the robots.txt file could only disallow access - now I know and will seek knowledge elsewhere -
phew...that was a long post
Thanks again!