Welcome to WebmasterWorld Guest from 54.144.231.243

Forum Moderators: open

FAST index: stop words, size and coverage

"Careful about what we put into the catalog"

   
7:27 am on Jun 14, 2002 (gmt 0)

10+ Year Member



Discussion expanded from incoming links [webmasterworld.com] topic, with in-depth clarification:

Another point I'd love to have some clarification on:

Stopwords and size of index. FAST doesn't utilize stopwords. Does that imply you have to store more indexed text as you'd have to when using stopwords? Is that a limiting factor to increasing your index?


Hi Heini,

Stopwords are indexed as part of our index, but are given very little weight during ranking, of course. The reason to have them there is to offer TRUE phrase matching (like "to be or not to be", "the best of the who"). Our algorithms handle this in a clever way, and it has no impact on scaling at all. ;-)

- Knut Magne / FAST

[edited by: Marcia at 3:20 am (utc) on June 17, 2002]

1:57 pm on Jun 14, 2002 (gmt 0)

WebmasterWorld Senior Member heini is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Thanks, Knut Magne

The true phrase matching is a cool feature. All the better if it doesn't drain on your resources.

Scalability is one of the features FAST always emphasizes - do you plan to enlarge your index even further over the next months?

11:18 am on Jun 16, 2002 (gmt 0)

10+ Year Member



Our scalability is a key features in most of our large scale enterprise installations (like FirstGov, eBay, Reuters). On the Web Search arena, we focus more on reach and coverage than the actual size number, and if you do studies on our index, you will find our coverage to be quite superior..

- Knut Magne / FAST

2:26 pm on Jun 16, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Hej Knut Magne,
Tak for dine svar :)

>reach and coverage

By this do mean that you foucs more on getting out in the corners of the web and finding special unique/content, rather than sheer number of pages?

To me, coverage in search engine terms has always been how much of the "audience" you cover - how many users you reach. Could you elaborate a bit on this?

3:00 pm on Jun 16, 2002 (gmt 0)

10+ Year Member



It's hard reveal direct parts of our roadmap, of course. But our goal is to serve the best possible search service for our customers and users. That implies having a large enough catalog, but we need to be careful about what we put into the catalog. So being able to cover the most important parts of the web, at a detail level that is the optimal for our users - that's where our focus lies.

- Knut Magne / FAST

7:51 am on Jun 17, 2002 (gmt 0)

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Hey, great to have someone from FAST on the board! Hope you'll be sticking around Knut and thanks for the information ;)

Nick

9:06 am on Jun 17, 2002 (gmt 0)

WebmasterWorld Senior Member heini is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Well, size does matter. Sure, weeding out duplicates is an important factor for keeping the quality of an index.
Nevertheless, given the size of the web even the elite class indexes of FAST and Google only reflect a small part of what's really there.

And then there are alternative file formats. With the new PDF indexing ( some 14 Mill indexed PDFs, if I'm correct), FAST has started to go into this direction.

Without any doubt the ability to index a varity of file formats is there - the corporate search technology from FAST indexes all kinds of files.

I certainly wonder: How large should an index ideally be, and what defines which files are worth indexing?

Tor

1:42 pm on Jun 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for sharing some of your knowledge with us Knut Magne. I hope you will be tuned in to this discussion forum regularly. :)
 

Featured Threads

Hot Threads This Week

Hot Threads This Month