homepage Welcome to WebmasterWorld Guest from 54.226.136.179
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

This 34 message thread spans 2 pages: < < 34 ( 1 [2]     
In the era of increasing numbers of bad bots is robots.txt irrelevant?
Why bother? Why not go the other way and white-list with controls?
Webwork

WebmasterWorld Administrator webwork us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 871 posted 9:29 pm on Feb 18, 2006 (gmt 0)

I just read through a number of posts in an attempt to understand what, if anything, can be done about bad bots.

I found the proposals for white-listing interesting: Throttle all bot type activity except bots with benefits. :)

Based upon what I've been reading I'm reduced to asking this about robots.txt:

Why bother?

If bother, how much?

 

bose

10+ Year Member



 
Msg#: 871 posted 9:14 pm on Feb 20, 2006 (gmt 0)

The problem with bad bots is only going to get worse with time.

It is aleady a whole lot worse than most would realize.

Banning those bad bots is not enough anymore. They just turn around and scrape content off SE provided (MSN preview, Google Cache, the list goes on...) caches. Those crawls do not show up in our logs, so unless one goes looking for those copycats proactively, one would never know...

No wonder many are seriously considering adding no-archive to all their pages. Brett has been doing it for years.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 871 posted 9:45 pm on Feb 20, 2006 (gmt 0)

Ask, Looksmart

Forgot I have Ask whitelisted too but not Looksmart.

Before I started building the whitelist I looked to see who had been sending me traffic over the last 12 months and any SE with no meaningful ROI for the crawl got dropped off the list.

It's an easy judgement call if you give up 40,000 pages a month to a SE and only get 3 visitors a month in return. Blocking that crawler is a real no-brainer in my mind compared to allowing other crawlers that may send you hundreds or thousands of visitors a day in return for the crawl.

No wonder many are seriously considering adding no-archive to all their pages

Yup, it's the only way to lock your content down as the SEs can't even stop bots hitting them appearing to be random from a series of anonymous proxy servers. That's why I'm also on a vendetta to stop all access by anonymous proxy servers I can detect as well as I've noticed crawling ping-pong'ing between several IPs that aren't even closely related and it turned out to be anonymous proxies.

Besides, leaving old cache content on SEs exposes your site in other ways that I'd prefer not getting into on this thread, that's a whole different debate.

wackybrit

10+ Year Member



 
Msg#: 871 posted 8:11 pm on Feb 21, 2006 (gmt 0)

Let me qualify the word OBEYS robots.txt as the bad bots try to get just enough information to fly under the radar undetected which is why I whitelist good bots by IPs which stops that problem in it's tracks.

Perhaps I've missed a post in this thread, but I can't see how any of this whitelisting stuff can work. Sure, it'll work against bots that use certain User-Agent strings, but what about those which use regular browser strings?

Not all bots are high impact, either, you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day.

It's better not to blacklist or whitelist, but actually serve defective content to people you don't want crawling. That way they think they're getting something useful, but they're really not (and it'll take them a lot longer to try other techniques). Fail quietly.

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 871 posted 9:21 pm on Feb 21, 2006 (gmt 0)

but I can't see how any of this whitelisting stuff can work

I bounce thousands of page requests a day from whitelisting alone as all the random bot names that come along just bounce off the wall, they get nothing.

but what about those which use regular browser strings ... you aren't going to stop a bot with Internet Explorer's User-Agent string that makes only 100 requests in a day

Server side script analyzes their behavior and challenges or bans them as well, so I'm stopping them and people using AlexKs script might be as well.

[webmasterworld.com...]

It takes multiple techniques to squish the nonsense, but it can be squished for the most part.

This 34 message thread spans 2 pages: < < 34 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved