homepage Welcome to WebmasterWorld Guest from 54.237.213.31
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Ocelli/1.1
One aggressive spider!
jam13

10+ Year Member



 
Msg#: 2643 posted 12:47 am on Dec 2, 2004 (gmt 0)

66.194.55.242
Ocelli/1.1 (http://www.globalspec.com/Ocelli)

Just got hit by this today, 5000 pages (the whole site) in 5 hours!

Engineering content my a**e.

Banned.

 

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 2643 posted 10:50 am on Dec 2, 2004 (gmt 0)

Just got hit by this today, 5000 pages (the whole site) in 5 hours!

Thats on average 1 page every 3.6 seconds -- sounds pretty reasonable to me... I suppose you probably have problem with overall number of pages crawled in a day (ie total bandwidth used, but not the risk of getting server down) rather than rate per second (ie risk of downing the server).

I am speculating here, but likely reason they crawled all these pages in one go is that they intentionally (like myself) grouped pages for the same server in the same "bucket", rather than many -- advantage of this approach is that it helps to minimise chance of the same server being hit at the same time by distributed crawlers. There are other advantages for processing as well.

This brings out the question - what are the acceptable rates of crawling:
1) 1 request per X seconds
2) max of Y requests per day

Some bots that support "crawler-delay" (in this case equals to 24*60*60/Y) param in robots.txt can be controlled to a degree.

jam13

10+ Year Member



 
Msg#: 2643 posted 11:25 am on Dec 5, 2004 (gmt 0)

I agree that a 3.6 second delay between pages is not a big problem, and it didn't use up too much bandwidth in the long run. But what bugs me is that our site has absolutely no relationship to engineering at all - it sells golf clubs - and yet they still crawled the entire site at their first visit.

Surely it would have been only polite to make an initial foray to determine whether or not we had any engineering related pages before attempting a full crawl? After all, what's the point of a niche content engine if it can't identify niche content?

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 2643 posted 11:56 am on Dec 5, 2004 (gmt 0)

After all, what's the point of a niche content engine if it can't identify niche content?

Yes I agree with that, if not from position of a webmaster, but from position of a search engine, which should naturally be interested in improving performance of the bot by trying not to waste bandwidth on both on sides to crawl something that won't be needed. It however requires less straightfoward coding than just grabbing ALL pages and then throwing those away that appear to be irrelevant. So, I suppose their argument will be that they need to analyse page in order to decide whether its relevant or not, rather than build some kind of a probabilistic model that will try to guess whether site overall is relevant based on a few pages (and take into account links from pages already known to be relevant).

I suspect that many engineers just prefer to choose easy ways of coding when they can afford it (ie they have lots of available bandwidth, and what happens on webmaster's site is sometimes not well considered), for example, how many bots who re-visit sites use Last-Modified datestamp to ensure they don't redownload the same document more than once? For some reason even google bot does not appear to use it (correct me if I am wrong here).

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 2643 posted 5:02 pm on Dec 5, 2004 (gmt 0)

Oops, my bad -- Googlebot does appear to suppose Last-Modified, so please ignore my rant above...

JAB Creations

WebmasterWorld Senior Member jab_creations us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 2643 posted 9:35 am on Dec 6, 2004 (gmt 0)

My spider list for the entire month of december to moments ago IoI...

ETS v5.1 translation service pre-fetch spider ¦www.freetranslation.com¦524650.73 MB06 Dec 2004 - 07:44
Oracle Ultra Search (Unknown Agent/Spider)2036+35025.63 MB06 Dec 2004 - 09:02
Fish search174256.60 MB05 Dec 2004 - 04:53
Google Spider ¦google.com¦1178+3229.30 MB06 Dec 2004 - 04:22

I use translation services so ETS is user depdent so forget that one.

Oracle is on crack! IOI... Fish had no robots.txt hits this month :-<

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved