| 10:50 am on Dec 2, 2004 (gmt 0)|
|Just got hit by this today, 5000 pages (the whole site) in 5 hours! |
Thats on average 1 page every 3.6 seconds -- sounds pretty reasonable to me... I suppose you probably have problem with overall number of pages crawled in a day (ie total bandwidth used, but not the risk of getting server down) rather than rate per second (ie risk of downing the server).
I am speculating here, but likely reason they crawled all these pages in one go is that they intentionally (like myself) grouped pages for the same server in the same "bucket", rather than many -- advantage of this approach is that it helps to minimise chance of the same server being hit at the same time by distributed crawlers. There are other advantages for processing as well.
This brings out the question - what are the acceptable rates of crawling:
1) 1 request per X seconds
2) max of Y requests per day
Some bots that support "crawler-delay" (in this case equals to 24*60*60/Y) param in robots.txt can be controlled to a degree.
| 11:25 am on Dec 5, 2004 (gmt 0)|
I agree that a 3.6 second delay between pages is not a big problem, and it didn't use up too much bandwidth in the long run. But what bugs me is that our site has absolutely no relationship to engineering at all - it sells golf clubs - and yet they still crawled the entire site at their first visit.
Surely it would have been only polite to make an initial foray to determine whether or not we had any engineering related pages before attempting a full crawl? After all, what's the point of a niche content engine if it can't identify niche content?
| 11:56 am on Dec 5, 2004 (gmt 0)|
|After all, what's the point of a niche content engine if it can't identify niche content? |
Yes I agree with that, if not from position of a webmaster, but from position of a search engine, which should naturally be interested in improving performance of the bot by trying not to waste bandwidth on both on sides to crawl something that won't be needed. It however requires less straightfoward coding than just grabbing ALL pages and then throwing those away that appear to be irrelevant. So, I suppose their argument will be that they need to analyse page in order to decide whether its relevant or not, rather than build some kind of a probabilistic model that will try to guess whether site overall is relevant based on a few pages (and take into account links from pages already known to be relevant).
I suspect that many engineers just prefer to choose easy ways of coding when they can afford it (ie they have lots of available bandwidth, and what happens on webmaster's site is sometimes not well considered), for example, how many bots who re-visit sites use Last-Modified datestamp to ensure they don't redownload the same document more than once? For some reason even google bot does not appear to use it (correct me if I am wrong here).
| 5:02 pm on Dec 5, 2004 (gmt 0)|
Oops, my bad -- Googlebot does appear to suppose Last-Modified, so please ignore my rant above...
| 9:35 am on Dec 6, 2004 (gmt 0)|
My spider list for the entire month of december to moments ago IoI...
ETS v5.1 translation service pre-fetch spider ¦www.freetranslation.com¦524650.73 MB06 Dec 2004 - 07:44
Oracle Ultra Search (Unknown Agent/Spider)2036+35025.63 MB06 Dec 2004 - 09:02
Fish search174256.60 MB05 Dec 2004 - 04:53
Google Spider ¦google.com¦1178+3229.30 MB06 Dec 2004 - 04:22
I use translation services so ETS is user depdent so forget that one.
Oracle is on crack! IOI... Fish had no robots.txt hits this month :-<