Forum Moderators: open

Message Too Old, No Replies

yangboz

         

wilderness

3:20 am on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



128.2.207.z - - [16/Apr/2008:13:19:24 -0500] "GET /robots.txt HTTP/1.1" 206
4723 "-" "Mozilla/5.0 [cs.cmu.edu...]

The link does not provide any info on the bots activity, nor does it offer an example of robots.txt exclusion.

Ocean10000

4:33 am on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have seen about 3 other Bots with CMU.EDU address's all Nutch variants.

I don't have anything on this one though. Just my best guess is that it is a Nutch variant, with a customized User-Agent. And you should be able to use the standard Nutch block in Robots.txt.

wilderness

6:10 am on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Nutch been in my robots.txt for an eternity.

Hobbs

8:33 am on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SetEnvIfNoCase User-Agent "\.edu" bad_bot
bad idea?

wilderness

1:58 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



bad idea?

Each webmaster needs to decide what is beneficial or detrimental to their own website (s).

denying ".edu" for me a bad idea, as I have far too many inquiries and communications with educational archive departments.
Not to mention some primary edu centers that provide links to my pages.

I do however have some 3rd party research centers denied access.

Some good arhive reading is the " keebler cookie" company, which in my own 2003 instance was actually FoMoCo.
A few weeks ago, I visited the Benson Ford Research Center and returned "tit-for-tat" ;)

Don

Hobbs

2:08 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



we're only talking about bots with .edu in their user agent, right?

incrediBILL

4:44 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The .edu's are usually somewhat well behaved and out of the bazillion pages of blocked sites I have archived, the user agents with .edu in it only accounts for about 7K blocked pages total.

Hardly seems worth the trouble, eh?

What I do block, after allowing the whitelisted bots access, is anything with "http:" in the user agent which nails just about every bot on the planet with a path to their site embedded plus a few odd browser plug-ins which could be whitelisted to avoid blocking but their advertising in my web logs annoys me.

The browser plug-ins only account for maybe 20-30 visitors a day out of about 20K visitors, but it actually caused so much tech support hassles for one browser plug-in that they revised their code to remove that http: path in the UA string.

Who said a single website can't make a difference? LOL

jdMorgan

4:59 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> Who said a single website can't make a difference? LOL

Good. Would you do me a favor, then? Make some tech-support trouble for those plugins that include their long CLASSID number as well. :)

I still haven't figured out what these suckers are:

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; {C10F4731-13CF-17A6-BD0D-2DFED03246AE}; .NET CLR (etc.)

(I should add that Yangbo looks like a pleasant-faced Doctoral candidate who just didn't read up on robots.txt while working on his data-mining projects.)

Jim

wilderness

5:33 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



{C10F4731-13CF-17A6-BD0D-2DFED03246AE}

Jim,
I've never figured these either, alhough I've had some variations of like terms denied for some time.

Perhaps some specific network or internet provider footprint.

incrediBILL

6:26 pm on Apr 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Make some tech-support trouble for those plugins that include their long CLASSID number as well.

I get a ton of those long CLASSID's so that wouldn't be good for business.

I figured only annoying 20-30 visitors a day was an acceptable collateral damage compared to the 100s of bots it stops for the same reason.