homepage Welcome to WebmasterWorld Guest from 107.21.135.68
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Efficient way to block multiple bots?
JAB Creations




msg:3300197
 4:22 pm on Apr 2, 2007 (gmt 0)

I want to block DiamondBot and WebarooBot. My initial guess would be something like this...

User-agent: DiamondBot, WebarooBot
Disallow: /

The bot names would be separated by commas? I've read the official site though it did not provide enough examples.

- John

 

jdMorgan




msg:3300262
 5:09 pm on Apr 2, 2007 (gmt 0)

This construct will work for some robots, but not all.
User-agent: robot1
User-agent: robot2
Disallow: /


The robots that do not understand this syntax (which is described in the robots.txt Standard) may assume that your robots.txt is invalid and either go away without spidering or spider the whole site.

A more bullet-proof method is to write multiple robots.txt files, and then internally rewrite to one of them based on the requesting user-agent. For example, you might have three files:

robots-all.txt:
# Allow access to all URLs
User-agent: *
Disallow:


robots-none.txt:
# Deny access to all URLs
User-agent: *
Disallow: /


robots-some.txt:
# Allow access to only some URLs
User-agent: *
Disallow: /cgi-bin
Disallow: /restricted_pages/


With those files, and a bit of mod_rewrite, each robot can be steered to the appropriate robots.txt file.

Note the trailing required blank line. Some robots will again misbehave if it is missing, even on the last record.

Code carefully -- Robots are fragile, and even the major search engine robots have bugs; MSN in particular currently can't differentiate between records addressed to the various msnbots for search, media, news, products, academics, etc. -- pretty surprising considering that the prefix-matching used in robots.txt is the simplest kind of pattern matching to implement. So, if it doesn't say you can do something in the robots.txt Standard, then you must assume that you cannot. Add to that the fact that some robots are not fully-compliant with the Standard, and you quickly reach the conclusion that coding to the lowest common denominator is the prudent choice.

Jim

JAB Creations




msg:3300272
 5:23 pm on Apr 2, 2007 (gmt 0)

A more bullet-proof method is to write multiple robots.txt files, and then internally rewrite to one of them based on the requesting user-agent.

Thanks for the nice idea though I do not have PHP permissions on *.txt extensions. I'm assuming I would need to use Apache to somehow achieve this form of cloaking which I can do with PHP (and already am implementing it to support Netscape 4 through to CSS3 proprietary implementations with the same layout) but am unsure of how to implement this form of cloaking with Apache?

- John

Brett_Tabke




msg:3300289
 5:41 pm on Apr 2, 2007 (gmt 0)

just try it once JAB. Put up a .txt file (with a perl script inside it) and set it as executable. See if it will run as a script.

[webmasterworld.com...] for code examples

JAB Creations




msg:3300393
 7:35 pm on Apr 2, 2007 (gmt 0)

Thanks for the reply, I backed up my original robots.txt and basically made a copy of your linked version. I tested this (just a Ctrl+a / Ctrl;+c ) on my live server without success (temporarily CHMODing as 777 in all cases).

After testing with an already existing (and working script) I know that on my public server I can execute Perl scripts in the txt extensions (though not exactly sure about the public root)...

I kept getting script errors after copying the file you linked to (including the robots2 file) so I played around with it (as robots.txt, robots.txt.pl, and robots.pl with CHMOD 777 combinations). I also removed your includes (just in case) and added use CGI::Carp qw/fatalsToBrowser/; though that never worked with the script (dang it) to test this as much as I could before posting. I've also tested this in my cgi-bin directory, an existing script's directory, and the root with all the other above combinations of tweaks to get the script to work all without success.

Here is what I hoped to be a minimal test case...
/cgi-bin/robots.txt [777]
#!/usr/bin/perl
use CGI::Carp qw/fatalsToBrowser/;
# (C) Copy and Copyright 2007 WebmasterWorld Inc. All Rights Reserved.
print "Content-type: text/plain\n\n";
$agent = $ENV{'HTTP_USER_AGENT'};
if ($FORM{"view"} eq "producecode") {open(FILE,"<robots.txt");print <FILE>;close(FILE);exit;}
# Simple agent check to keep the snoopy happy and to keep bad bots out and good bots in.
if ($agent =~ /slurp/gi $agent =~ /msnbot/gi $agent =~ /Jeeves/gi $agent =~ /googlebot/gi) {
open(FILE,"<robots2");
print <FILE>;
close(FILE);
}

else {

print qq#
# (C) Copy and Copyright 2007 WebmasterWorld Inc. All Rights Reserved. [webmasterworld.com...]

User-agent: *
Disallow: /
# (C) Copy and Copyright 2007 WebmasterWorld Inc. All Rights Reserved.
;

}

1;

As a side not to test your server's reactions I did run a 4/5ish-hit spoof test with Ask, Google, MSN, and Yahoo agents (amended with "(simple jabcreations fake spoofing test)" for your log related analysis just in case (71.180..)) and I could see the desired effects on your server just fine.

jdMorgan




msg:3300411
 7:57 pm on Apr 2, 2007 (gmt 0)

The first line, currently commented-out in the script above, is the path to your PERL interpreter. Since it is commented-out, you're depending on this path being defined at the server level. Are you sure it is?

Jim

JAB Creations




msg:3300415
 8:01 pm on Apr 2, 2007 (gmt 0)

It is the exact same for the linked script (on this server) and for a working script on mine).

rharri




msg:3300518
 9:42 pm on Apr 2, 2007 (gmt 0)

Jim,
In the script, what purpose does data/varsv4.cgi serve?

Bob

Markus Klaffke




msg:3300574
 11:06 pm on Apr 2, 2007 (gmt 0)

Yes, what purpose does data/varsv4.cgi serve?

Ok, all in all, i dont understand a word.

Am i right? webmasterworld does offer this perl script to the community, but does not use it for its own servers?

See [webmasterworld.com...]

Maybe i missed or overred the howto for this perl script, but it should be placed in

/robots.txt
/robots2
/data/varsv4.cgi

and not in /cgi-bin/

or should somebody write a mod_rewrite rule for this?

Am i right?

JAB Creations




msg:3300607
 11:45 pm on Apr 2, 2007 (gmt 0)

Markus, I was testing in my cgi-bin directory because I know that it specifically allows Perl scripts to execute (some servers I have had hosting with limit Perl executing only in that directory).

The /robots2 file is the alternative robots.txt content (like a PHP includes file if you only know PHP)...so when a good bot is detected Perl substitutes what is below with the alternative Perl includes file.

I'm just having a daisy of a time trying to even get an informative error message from the script right now with at least would give me an idea to go Googling for a possible fix if any.

- John

rharri




msg:3301033
 2:33 pm on Apr 3, 2007 (gmt 0)

Hah!

Include is cgi-lib.pl (had to download this and modify script)

Have to tell Apache to execute the file:
<Location /robots.txt>
SetHandler cgi-script
Options ExedCgi
</Location>

Reboot Apache

It works! :-)

Bob

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved