Forum Moderators: mack

Message Too Old, No Replies

Search Engines - Robot.txt and protected files

searching protected files

         

Perplexed

9:12 am on May 6, 2003 (gmt 0)

10+ Year Member



Hi Fellas.
I am new here and could do with a little help. Sorry if this gets a bit long.

I have a site with a protected members section and a "visitors room" Articles etc in the visitors room obviously get spidered and these form the basis by which everyone finds the site.

The content of the members area does not get spidered because of the password system and it is this that causes the problem. It's a catch 22.... If I could get the content listed it would increase the number of chances of finding the site .... but then there would be no point in joining!

It occurs to me that I could have a second copy of the files with the links removed or changed so that these could be spidered but only give access to the one document ( ie, the only link on the page would be back to the visitors area content page )

Since there would be no links into these pages from the visitors area it begs the question of how the spiders will find them.... So... the big question is... Could the robot.txt be used to list the specific files urls to get them listed that way?

I hope this makes sence

Regards.
Perplexed ( well, I am )

chiyo

9:23 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm.. interesting. If it could be done, (I wouldnt have a clue how to do it, but it sounds like you may be needing some sort of cloaking which is always a risk), wouldn't visitors realise that they can get all articles for free by doing advanced searches on the search engines? say like using "domain:yoursite.com keywords", in more cases than you would like?

How about open-to-spidering "abstract pages" which list say abstracts of all articles in each topic area? You may then be able to create lots of keyword-rich pages. This whole page could link to your sign up page. Just make sure that you title the page "abstracts" or "summaries" or something like that, or otherwise you would be misleading people that they will be getting whole articles.

Perplexed

9:33 am on May 6, 2003 (gmt 0)

10+ Year Member



Hi.
I suppose that is a possibility. I did play around with giving access to a part of each article but when I accessed them ( with an attempt at an impartial mind ) I could see that I would have been annoyed at getting a lead in to a pay area... If people could see the whole of the article they want they wouldn't mind the intro to the site as a whole.

I am not sure about how these "advanced searches" work. I would not have known how to do that but obviously others would.

Can I not just put a list of things to search in the robot txt? ie, allow/article 1 allow: article 2 etc.

I was about to ask how people would get a full list and access to them but I suppose they could just type www.domain/robot.txt to see the list.... Is there any way of blocking this without blocking the spider?

I think I am beginning to ramble :)

dmorison

9:36 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Have a sniff around how Google News and the New York Times website work together.

I don't know the exact mechanisms that are going on, but if you go to Google News and click through to an article at the New York Times you get straight in.

If you go to the New York Times front page and click on the same article you have to go through a registration process.

I'll do a bit of digging and see if I can figure out exactly what they're doing - it might help you out!

Perplexed

9:41 am on May 6, 2003 (gmt 0)

10+ Year Member



That would be good dmorison, thanks.

Its comforting in a way.... I thought the answer to this would be obvious and I was to thick to see it. Its kinda nice to know it isn't.

dmorison

10:18 am on May 6, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Ok, there's obviously manual collusion to make it work (Google being a big search company and NYTimes being a major "news" source) but it's something along the lines of this:

The Google News Bot crawls nytimes (probably partners.nytimes.com) as a registered user, and then replaces the URL with "www.nytimes.com" and appends an authentication key that allows the article to be viewed when being referred via Google News and with the appropriate key in the URL.

Therefore, as soon as you try and navigate off the page referred to by Google News, your authentication fails and the site channels you through the registration process.

You used to be able to go straight to "partners." instead of "www." and bypass the registration process at NYTimes, but too many people got wind of that so it was stopped.

So, in the absence of collusion with any search engine through which you want to achieve this effect, I don't see that you have any option other than cloaking - i.e. allow search engine spiders to see your "protected" content, and then allow access to it only by referral from those search engines. Easily abused by a crafty user of course.

<snip suggestion that you'd already discounted!>

At the end of the day, given that you are charging for the information I think you automatically loose your "right" to have it indexed by a public search facility, unless you get into bed with them like NYTimes have had to do.

Perplexed

10:32 am on May 6, 2003 (gmt 0)

10+ Year Member



thanks for that. I still think there has to be a way around this. Just can't figure out what it is.

As a basic premise, would listing the files on the robot txt actually get them spidered when there were no other links into them?

Perplexed

7:06 am on May 7, 2003 (gmt 0)

10+ Year Member



Hi
Still trying to work this out after a decent nights sleep but I need something clarifying.

Do spiders go through the site directory and read all allowed files or do they actually go to one page and then follow links from page to page? i.e would they find a page that only had outbound links?

JamesR

5:34 pm on May 7, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Do spiders go through the site directory and read all allowed files or do they actually go to one page and then follow links from page to page?

from page to page and follow links

>would they find a page that only had outbound links?

no, not unless you sumbitted just that page but it probably won't even rank in Google, gotta have inbound links as a prerequisite