Forum Moderators: open
I get hit by Crawl_Application about once a week and each visit it requests a totally random index ONE time.
129.34.20.19 - - [06/Oct/2003:13:28:52 -0700] "GET /Blah_EIGHTY_EIGHT.html HTTP/1.1" 200 2392 "-" "Crawl_Application" What are your experiences with Crawl_Application?
Is this random selection of indices indicative of C_A?
Has anyone heard more on IBM's Master Plan?
Thanks.
Pendanticist.
I read that earlier, but forgot to delete the question when I decided not to convolute the thread. <blush>
My site's been in DMOZ since '97 (I think) and just about every directory/database/Search Engine (past/present) and have been manually reading my access_log files ever since too. In other words, I've seen methodical patterns.
I see other bots run thru my stuff. Most do so in a seemingly pre-defined manner and that's NOT counting the rippers! (Seen 'em nail 15-18 files per second and never blink an eye and so have the rest of you.) So, we know the technology to write a fast, thorough rogue bots is out there to be downloaded and used by just about anyone, right?
Based on the rate/pace with which C_A has been visiting me, I don't see their work reaching fruition sometime around the turn of the Century.
Pendanticist.
Here's a post from back in 2001 about Clever:
[webmasterworld.com...] Here's another on the same subject: [domino.watson.ibm.com...]
Focused Crawler, is a high-end version of Clever and is described in the w8 site document:
The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.
[www8.org...]
IBM's focus was to build a topic-specific library by crawling a small fraction of the Web instead of how SE's do it now. That article makes it appear they are doing the same massive crawling.