Forum Moderators: mack
In early October I implemented a bot-blocking routine [webmasterworld.com] that finally worked in blocking long-term scrapers (bots that take a page each and every second or so until they have scraped an entire (maybe 100,000-page) site). It worked very well. Too well! The Google Adsense-bot (21,109 in October) and Inktomi Slurp (55,726) were both caught taking more than the max 777 pages allowed in a 24-hour period. However, both seem to have learned--or been re-programmed--and it no longer happens.
A week or more ago the MSNBot started to get caught in the same snare. I upped the limit to 1,000 pages per day, but still it went on, and on, and on. This bot took 15,972 pages in October. In the first 28 hours of November it has taken 1,110 pages and tried to take another 1,767 (which were refused due to the bot-blocking). It:
</rant>
Microsoft should stop jerking around with their own search technology and buy a real search engine.
Regards...jmcc
# Does not use If-Modified-Since
Do you know any other web scale crawlers that support If-Modified-Since? I think Googlebot might support it now, but it is certainly a very recent (2005) change. If-Modified-Since is trivial for browsers for not for bots that have to deal with billions and billions of urls.
AFAIK MSNbot supports Crawl-Delay parameter so you could have easily limited number of requests per day, say make it 120 seconds or higher, or better -- exclude some or all of your URLs via robots.txt and you won't have problem with bot(s).
Microsoft should stop jerking around with their own search technology and buy a real search engine.
Like which one? Who is much better than Microsoft in search engines? Google, Yahoo and then who? Pretty much nobody, and only fool would license core technology from key competitors. Its not an option for Microsoft and its good that they invest money because there is nothing worse than a monopoly - be it Microsoft's or Google's.
Do you know any other web scale crawlers that support If-Modified-Since?
It is trivial for the search technology to implement. They simply need to have the will to do it.
It is trivial for the search technology to implement.
Nothing is trivial when you deal with billions of pages - MSNbot was in development far less than any other major bot so its not suprising its not as advanced as other bots.
Your main issue seems to be with number of requests rather than support for If-Modified-Since, this can be controlled using Crawl-Delay parameter that MSNbot was the first (I think) to support.
The reason that some bots don't support compressed pages is most likely due to very few webservers actually supporting it.
Can't find info on Googlebot
Nothing is trivial when you deal with billions of pages
Your main issue seems to be with number of requests
The reason that some bots don't support compressed pages is most likely due to very few webservers actually supporting it.
PS The Teoma crawler (Ask Jeeves) also supports compression.
Nonsense (sorry).
AlexK - I have written a bot that crawled 2 bln pages so far, it supports gzip among other things and from this very representative sample I concluded that number of servers that support gzip is extremely low: I was very disappointed. The fact that its easy to support and that you have PHP is not relevant to simple fact that very few servers support gzip, that's fact.
I have to admit that there is merit in *your* POV; I hope that you can also see the merit in mine. My former hosts went goggle-eyed on me when I specified dual-multi-Gig-Xeons for the Linux box that they were going to build for me - after all, who needs such specs when Linux can run on a 486, right? Wrong! There is a reason that human kind is built with eyes that face forward.
It's OK, I feel better now, nurse.
"it is enacted, active and is working" point of view
True - I am merely commenting on the way things are. I wish more people like you switched on support for gzip because believe me - bots want to use as little bandwidth as possible and if any significant number of people supported gzip then most bots would have supported it too.
It's OK, I feel better now, nurse.
That's good - invoice is in the post ;)