Forum Moderators: open

Message Too Old, No Replies

I'm sure you've heard this before... list of all of 'em...

         

andrewrab

11:16 pm on Sep 9, 2003 (gmt 0)

10+ Year Member



Or something similar...

Hey all... could someone point me in the direction of a fairly comprehensive and updated list? One that doesn't read like some weird Unix Bible (ack, I'm a Windows guy!)...

Preferably one that lists the pain-in-the-ass crawlers AND one that updates with IP/User Agents for things like GOogle, MSNBOT, and keeps things current...

... even if (oh no!) I have to pay for it.

Thanks... we've put 99% of our effort into Google, but are finally tired of all the NameProtects, and Archives, and the like... and further, would like to keep up with the Joneses (e.g. Gigabot, MSNBot, Yahoo new... etc.)...

See ya.

wilderness

3:29 pm on Sep 10, 2003 (gmt 0)

andrewrab

2:42 pm on Sep 11, 2003 (gmt 0)

10+ Year Member



Hey Wilderness...

Thanks a lot... I had seen a link on here when I searched that was LIKE THIS ONE but included a lot more Unix stuff (.htaccess), so I didn't bother!

Can you clarify something for me?

If I see this, for instance, on the posting you sent me to:

RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]

Would I just add them like this robots.txt?

User-agent: EmailSiphon
Disallow: /
User-agent: EmailWolf
Disallow: /
User-agent: ExtractorPro
Disallow: /

OR, do you think trying to use robots.txt is basically worthless because they'll just ignore it? If so, any thoughts on what to do on Windows machines? Including in some cases where I may not have access to the entire box (full root access) as I do most of the time?

Thanks Wilderness!

SuzyUK

2:58 pm on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



andrewrab
(ack, I'm a Windows guy!)

I've been having this problem too and thanks to some wonderful help on this thread [webmasterworld.com]

I'm closer to a solution, as well as understanding some of this "weird Unix" stuff ;)

There's a script at the end which can be modified to work like htaccess..

Not quite the answer you're after but it would mean you could use the htaccess ban lists..

Suzy

wilderness

5:09 pm on Sep 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Andre,
the expressions (in some instances partial words) used in RegEx for htaccess rewrites may not be the full name the bot requires to comply with robots text.

The three examples you provided are a waste of time to add to your robot's they are both mischievious and non-compliant. There are many more that fit into this non-compliant.

I rarely use robots.txt these days. Unless I happen to see an error in my logs which the few compliant bots follow. Or unless I add a new subfolder.

I'm not sure if this link will help you (which I saved for IIS rewrites)
[webmasterworld.com...]