Forum Moderators: mack
I have a site with a protected members section and a "visitors room" Articles etc in the visitors room obviously get spidered and these form the basis by which everyone finds the site.
The content of the members area does not get spidered because of the password system and it is this that causes the problem. It's a catch 22.... If I could get the content listed it would increase the number of chances of finding the site .... but then there would be no point in joining!
It occurs to me that I could have a second copy of the files with the links removed or changed so that these could be spidered but only give access to the one document ( ie, the only link on the page would be back to the visitors area content page )
Since there would be no links into these pages from the visitors area it begs the question of how the spiders will find them.... So... the big question is... Could the robot.txt be used to list the specific files urls to get them listed that way?
I hope this makes sence
Regards.
Perplexed ( well, I am )
How about open-to-spidering "abstract pages" which list say abstracts of all articles in each topic area? You may then be able to create lots of keyword-rich pages. This whole page could link to your sign up page. Just make sure that you title the page "abstracts" or "summaries" or something like that, or otherwise you would be misleading people that they will be getting whole articles.
I am not sure about how these "advanced searches" work. I would not have known how to do that but obviously others would.
Can I not just put a list of things to search in the robot txt? ie, allow/article 1 allow: article 2 etc.
I was about to ask how people would get a full list and access to them but I suppose they could just type www.domain/robot.txt to see the list.... Is there any way of blocking this without blocking the spider?
I think I am beginning to ramble :)
I don't know the exact mechanisms that are going on, but if you go to Google News and click through to an article at the New York Times you get straight in.
If you go to the New York Times front page and click on the same article you have to go through a registration process.
I'll do a bit of digging and see if I can figure out exactly what they're doing - it might help you out!
The Google News Bot crawls nytimes (probably partners.nytimes.com) as a registered user, and then replaces the URL with "www.nytimes.com" and appends an authentication key that allows the article to be viewed when being referred via Google News and with the appropriate key in the URL.
Therefore, as soon as you try and navigate off the page referred to by Google News, your authentication fails and the site channels you through the registration process.
You used to be able to go straight to "partners." instead of "www." and bypass the registration process at NYTimes, but too many people got wind of that so it was stopped.
So, in the absence of collusion with any search engine through which you want to achieve this effect, I don't see that you have any option other than cloaking - i.e. allow search engine spiders to see your "protected" content, and then allow access to it only by referral from those search engines. Easily abused by a crafty user of course.
<snip suggestion that you'd already discounted!>
At the end of the day, given that you are charging for the information I think you automatically loose your "right" to have it indexed by a public search facility, unless you get into bed with them like NYTimes have had to do.
from page to page and follow links
>would they find a page that only had outbound links?
no, not unless you sumbitted just that page but it probably won't even rank in Google, gotta have inbound links as a prerequisite