Allowing spidering of password-protected pages - Cloaking forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

Allowing spidering of password-protected pages

Using agent delivery to redirect visitors to a registration form

Robert Charlton

6:42 am on Mar 15, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

A non-profit educational site I've optimized is considering using password protection in order to boost registration on the site. I have questions about the wisdom of this, but I'm confining questions here to making sure the site will be properly spidered and not run afoul of the engines.

The IT department of the site would use agent delivery to allow spidering of the pages. They're not experienced in agent delivery, and neither am I. The plan is that spider would go right to the site... Visitors would be redirected to a registration form.

Here's the check they'd be doing for crawlers (Apache server):


 if ( $browser =~
/robot¦slurp¦crawl¦scooter¦googlebot¦libwww¦JennyBot¦polybot¦
ferret¦spider¦psbot¦openbot¦zyborg¦webstream\.net¦archiver¦
internetseer¦pompos¦ask jeeves¦teleportpro¦mercator¦
python-urllib¦webzip¦slysearch¦netsweeper/ ) {
 return;
 }

If it's not a crawler, they'd do a redirect for the visitor to the appropriate registration page:


 print $cgi->redirect(-uri =>
"http://$ENV{'HTTP_HOST'}/cgi-bin/getReg/$section/$topic");

I'm not qualified to comment on whether the above would work, but I'd appreciate all thoughts I can pass on.

I want to make sure that the pages will get spidered, and also would value opinions on whether the procedure would withstand manual inspection by the engines. I'm assuming it would, and that the engines would manually inspect before they'd ban. The site is highly respected, clearly non-profit, and the content delivered to the user after registration is exactly what the spider sees.

Nick_W

7:38 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I'd say a handcheck would most likely be okay but nothing is certain. Thats what you need to tell the client I think...

As for the code? - What language is it?

Nick

jdMorgan

11:58 pm on Mar 15, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Robert Charlton,

Gotta run, but here's a quick cut and paste with a few more you may want to include - depends on your market and what directories you may be listed in. This is straight out of .htaccess, so the format's all wrong, but you can pick out any user-agents you might want.

Jim

# Excite spider (may be out of business)
RewriteCond %{HTTP_USER_AGENT} !^ArchitextSpider$
#
# ExactSeek spider
RewriteCond %{HTTP_USER_AGENT} !^ExactSeek\ Crawler/[1-9][0-9]?\.[0-9]{1,2}$
#
# Fast robot
RewriteCond %{HTTP_USER_AGENT} !^FAST\-WebCrawler/[1-9][0-9]?\.[0-9]{1,2}.*\ \(.*fast.*\)$
#
# GigaBlast robot
RewriteCond %{HTTP_USER_AGENT} !^Gigabot/[1-9][0-9]?\.[0-9]{1,2}$
#
# Looksmart robots
RewriteCond %{HTTP_USER_AGENT} !^MARTINI$
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[0-9]{1,2}\.[0-9]{1,2}.*\(compatible\;\ Zealbot\ [1-9][0-9]?\.[0-9]{1,2}\)$
#
# Lycos spiders
RewriteCond %{HTTP_USER_AGENT}!^Lycos_Spider_\(.*\)$
#
# Microsoft link checker libwww-perl/5.51
RewriteCond %{REMOTE_HOST} !^.*\.microsoft\.com$
#
# NationalDirectory WebSpider 1.3
RewriteCond %{HTTP_USER_AGENT} !^NationalDirectory\-WebSpider/[1-9][0-9]?\.[0-9]{1,2}$
#
# Openfind spider
RewriteCond %{HTTP_USER_AGENT} !^Openfind\ data\ gatherer\,\ Openbot/[1-9][0-9]?\.[0-9]{1,2}\+\(.*openfind.*\)$
#
# Polybot robot from NY Polytechnical
RewriteCond %{HTTP_USER_AGENT} !^polybot\ [1-9][0-9]?\.[0-9]{1,2}\ \(.*cis\.poly\.edu/polybot/\)$
#
# ScrubTheWeb spider
RewriteCond %{HTTP_USER_AGENT} !^Scrubby/[1-9][0-9]?\.[0-9]{1,2}\ \(.*scrubtheweb.*\)$
#
# SearchHippo spider
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ Fluffy\ the\ spider\;\ .*searchhippo.*\)$
#
# Teoma robots
RewriteCond %{HTTP_USER_AGENT} !^Teoma [NC]
#
# Thunderstone
RewriteCond %{HTTP_USER_AGENT} !^Mozilla/[1-9][0-9]?\.[0-9]{1,2}\ \(compatible\;\ T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E\)
#
#Vagabondo/2.0 MT (webagent@NOSPAMwise-guys.nl)
RewriteCond %{HTTP_USER_AGENT} !^Vagabondo/[1-9][0-9]?\.[0,9]{1,2}\ MT\ \(webagent.*wise\-guys\.nl\)$
#
# Yahoo directory checker
RewriteCond %{REMOTE_HOST} !^.*\.corp\.yahoo\.com$
#
# appie 1.1 (www.walhello.com)
# BunnySlippers (from tide.microsoft.com)
#
# DMOZ ODP robot
RewriteCond %{HTTP_USER_AGENT} !^Robozilla/[1-9][0-9]?\.[0-9]{1,2}$
#
# DMOZ ODP editor
RewriteCond %{HTTP_USER_AGENT} !^Tulipchain
#
# SITE CHECKING TOOLS
#
# W3C_Validator W3C_Validator/1.183 libwww-perl/5.64
RewriteCond %{HTTP_USER_AGENT} !^W3C\_Validator/[1-9][0-9]?\.[0-9]{1,4}\ libwww\-perl/[1-9][0-9]?\.[0-9]{1,3}$
#
# Search Engine World Robots.txt Validator
RewriteCond %{HTTP_USER_AGENT} !^Search\ Engine\ World\ Robots\.txt\ Validator
#
# Xenu Link Sleuth 1.2c
RewriteCond %{HTTP_USER_AGENT} !^Xenu\ Link\ Sleuth\ [1-9][0-9]?\.[0-9]{1,2}
#
RewriteCond %{HTTP_USER_AGENT} !^LinkScan/[0-9]{1,2}\.[0-9]{1,2}\ Unix$

Robert Charlton

5:50 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>>As for the code? - What language is it?<<

I don't know... Thought you guys would. ;) I'm not even sure where it's used on the server. I'm pretty much a beginner here. I think I understand the principles... I know I don't understand the specifics.

The string "modperl" appears in some of the fake directories that were created when we made the urls more user and search-engine friendly, so the language might be perl.

Jim - Thanks for the additional user agents. Is there a complete list on WebmasterWorld of bots they should allow? And where are updates to the spider list available? I assume keeping the list current is one of the costs of maintaining such a system. Also, I wonder, does Freshbot have a separate ID, or is it simply googlebot?

jdMorgan

7:50 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Just "googlebot" will work for the user-agent of either deepcrawler or freshbot. And yes, they'll need to keep the list current, but as long as you cover the majors, there's not too much risk of disaster.

Jim

andreasfriedrich

8:26 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>What language is it?
>>language might be Perl [perl.com]

It is Perl [perl.com] indeed probably using Lincoln Stein´s CGI [perldoc.com] module.

Andreas

Robert Charlton

11:02 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Thanks.... I took a look at the Perl links, and they're way over my head. Can anyone easily tell whether the code I posted has basically got things right, or whether there's something which has been overlooked?

andreasfriedrich

11:22 pm on Mar 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

>>tell whether the code I posted has basically got things
>>right

I cannot tell for sure from the code snippets you posted but I believe it is ok. Where does the return [perldoc.com] statement return to?

Since this is just UA delivery you can easily check by using a fake UA.

Andreas

Robert Charlton

12:11 am on Mar 17, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>>Where does the return statement return to?<<

I can't tell you in terms of code, but in terms of functionality I understand that it simply returns the website (to the bots) as it would be without the registration form.

Is there a recommended list online of "good" robots that's kept updated?

anallawalla

9:14 am on Mar 17, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I have often seen Google showing hits to pages that are subscriber-only and wondered how it got in. You will need to deny caching or else people will be able to view the cached content.

Robert Charlton

5:56 pm on Mar 17, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>>You will need to deny caching or else people will be able to view the cached content.<<

Good point. I'm guessing that using the "noarchive" tag would trigger a manual inspection at some point.

I came acrosss a Google News thread about Getting subscription pages indexed [webmasterworld.com]. There was general negative reaction to the idea of indexing such content:

I would find it incredibly annoying if I came across a site doing this. You are wasting my time and reducing the quality of the search results.

Anyone have any idea whether the engines would react this way too?