Forum Moderators: open

Message Too Old, No Replies

Crawl_Application (from IBM?)

What's the latest?

         

pendanticist

12:25 am on Oct 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In Maybe a spider? "Crawl_Application" [webmasterworld.com] (July 9, 2003) WebmasterWorld member werty mentions being hit (Note the UA strings) by Crawl_Application and fiestagirl response indicates that IBM's crawler is in the fledgling stages.

I get hit by Crawl_Application about once a week and each visit it requests a totally random index ONE time.



129.34.20.19 - - [06/Oct/2003:13:28:52 -0700] "GET /Blah_EIGHTY_EIGHT.html HTTP/1.1" 200 2392 "-" "Crawl_Application"



If this week's visit requested "Blah_ONE.html" , next time it'll ask for "Blah_TWENTY_FOUR.html" and the time after that "Blah_SIX_HUNDRED.html" and then come back for "Blah_FIFTY_TWO.html.

What are your experiences with Crawl_Application?

Is this random selection of indices indicative of C_A?

Has anyone heard more on IBM's Master Plan?

Thanks.

Pendanticist.

bakedjake

1:10 am on Oct 16, 2003 (gmt 0)

pendanticist

2:53 am on Oct 16, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks bakedjake :)

I read that earlier, but forgot to delete the question when I decided not to convolute the thread. <blush>

My site's been in DMOZ since '97 (I think) and just about every directory/database/Search Engine (past/present) and have been manually reading my access_log files ever since too. In other words, I've seen methodical patterns.

I see other bots run thru my stuff. Most do so in a seemingly pre-defined manner and that's NOT counting the rippers! (Seen 'em nail 15-18 files per second and never blink an eye and so have the rest of you.) So, we know the technology to write a fast, thorough rogue bots is out there to be downloaded and used by just about anyone, right?

Based on the rate/pace with which C_A has been visiting me, I don't see their work reaching fruition sometime around the turn of the Century.

Pendanticist.

BlueSky

2:53 am on Oct 16, 2003 (gmt 0)

10+ Year Member



That article written in Sep 2003 is kinda interesting. It seems totally counter to their Clever and Focused Crawler research they've been doing since 1999.

Here's a post from back in 2001 about Clever:
[webmasterworld.com...] Here's another on the same subject: [domino.watson.ibm.com...]

Focused Crawler, is a high-end version of Clever and is described in the w8 site document:
The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date.
[www8.org...]

IBM's focus was to build a topic-specific library by crawling a small fraction of the Web instead of how SE's do it now. That article makes it appear they are doing the same massive crawling.