Forum Moderators: bakedjake
My background includes real time processes. Am also a sort of technical visionary, sometimes peering so far over the horizon no one will listen.
For instance, my first vision of peer-to-peer parallel processing was in 1984. People thought I was joking. Google is playing with that now, and I'm not up to date on this but I'm sure many others are too. Then, of coarse, the peer-to-peer file sharing networks like Napster, Kazaa, etc. are testimony to the potential of PTP wide area networks.
I saw the "information highway" forming while most people were still trying to decide whether a home PC was worth the investment, and for the most part, deciding against them.
I'll start by saying the only place the resources exist for a next-generation search engine are on peer-to-peer networks. Server farms of 10,000+ processors just won't cut it. Think of a small application running behind a search toolbar on a few hundred thousand computers. Think of a hierarchal real-time resource manager managing the data acquisition, storage and processing as computers randomly enter and leave the network.
The concepts here aren't rocket science. In fact, they're old stuff as anyone who cooked up their own real-time multi-tasking operating systems and designed distributed real-time processes can attest to.
One of the most challenging problems today for a search engine is to cope with all the dynamic pages. There are plenty of sites that use dynamic pages to manage their very valuable content; but there is also a huge amount of crap.
A group of volunteers could be really helpful in this situation and provide address schemes which tell the engines where there are the good dynamic pages.
This should result in some kind of a "white list", where the "good" dynamic URLs are listed as regular expression patterns:
www.site1.com/article.php?id=[0-9]{1,4}
www.site2.com/board/thread.cgi?forum=[0-9]+&id=[0-9]{1,5}
With such a list, the engines would be able to crawl a notable part of the hidden web, but avoid wasting their ressources on garbage pages.
Some of our theories and observations are:
- The web is 4 Billion pages, but only 45 Million domains exist.
- Utilize and leverage existing technologies.
- DMOZ is great, it should be used as a core (1.8 Million domains)
- Yahoo is great, (700K domains)
- Advanced relevancy technology is needed.
- Database approach is too slow.
- Distrubed Toolbar approach is too slow.
- Need for SEO minded people as the core group to better understand the SE.
- C is needed for core processes.
- Perl or Phython should be used for crunching.
Give me a talk to.. Summerize everything in this tread.. We are looking to redo this search engine..
Wayne
wbienek@lvcm.com
[edited by: jeremy_goodrich at 9:12 pm (utc) on June 16, 2003]
[edit reason] snipped domain name - please see TOS thanks [/edit]
After all, I'm not about to go dropping the name of MY search tool for everybody to peruse - as much as I might like to.
Thanks much - great discussion, and I can see that there are potentially LOTS of great things that could come out of collaboration among the various parties here.
In any case do we decide to use cache to lookup information like yahoo?
Are we going to use link popularity or page ranking?
We going to have any directory at all in it(even google now has a directory on its web site)
I do believe a serious discussion needs to involved of ppl who are taking this seriously and not in it for the money.
Anyways, since it seems that most of the people I have stickied with are more in the middle to top end and not on the low level of se work, I remembered awhile ago that google had some sort of programming contest where they gave out a bit of code and a dump of data, so that is probably the first route I will give a try to. Like I have already said, spidering is probably the area that i need the most help, I can either put up something whereby people take the rawest of the raw spider output (which is mostly an xml header w/html document) *or* the parsed output (which I think would
be more useful, especially for people who are into the whole indexing thing). Or better yet - both. :)
Anyway, I have provided mostly data files and descriptions of how i go from one place to another with the hopes that you guys can have some creativity out there. We'll see! So you don't have to download the whole thing, the 'readme' like file is provided:
Jun 16 2003
You're free to do whatever you want with all of this stuff.As I've previously stated, it is pretty clear (to me) the most amount of
work needs to go into the spider and parser. Anyone who does any work that is used at
xxx based on these files will be given appropriate credit.xxx basic file format info:
First, the spider runs and generates:
fluffy.out:
This is a pseudo-xml doc which contains the url as I see it. Each "FluffyURL" section acts as a
record separator as well. SEO's: don't try to put this into your page because the spider will
toss it.-----
This file is then processed by a metadata generator which creates two files, fluffy.out.db and
fluffy.out.log. The log is used to find other urls to crawl and also combined with other data
(toolbar usage, links to xxx, etc) to generate the overall hipranking for a site.-----
fluffy.out.db:
This is exactly the same format as fluffy.hrd (below), except that it doesn't have the hiprank.
Also, the ID ends up getting reassigned.fluffy.out.log:
1) Link type
2) If "1", this is a non local link
3) Link destination-----
Last of the metadata parsing/munging, the fluffy.out.db file is ranked and produces:
-----
fluffy.hrd:
1) RecID (assigned)
2) HipRank (int, absolute value, not geometrically scaled)
3) URL (fully normalised url)
4) DomRev (just the domain in reverse)
5) AdultFlag (int)
6) TimeStamp
7) Size (int)
8) PageTitle
9) PageText
10) PagePhrase (text, comma separated phrases, user submitted listing stuff)
11) OwnerID (text, user submitted listing)
12) Modifiers (for customisation, should always be blank here)
13) Meta Description
14) Meta Keywords
15) ImgAlt text, (backtick separated per image in order found on page)
16) Record Type (int)
17) Link texts (backtick separated, per link in order found on page)
18) EMBText (any text found with <EM> or <B>'s. backtick separated, per link in order found on page)
19) HText (any text found within <Hx>'s backtick separated, per link in order found on page)
20) CTURL (specialized clickthrough url)-----
This file (fluffy.hrd in this example) serves as the input to yet another filter which
produces a series of "tagmaps" for different parts of the data which is sorted based
on the "word rank". The file format here is1) Tag
2) WordRank (this is a geometrically scaled value, higher=better)
3) RecID (from fluffy.hrd)
4) AdultFlag (really from fluffy.hrd)
5) DomRev (again, from fluffy.hrd)Ultimately, this is the inverted index. Note the coarseness -- no word positioning
or flagging here, it's all been incorporated to the WordRank. How should it be searched?If you have questions or comments, email to xxxxxxxxxxx
To get the raw data files, go to "my site" /hiptest-20030616.tgz -- this file is about 2M.
If you can't figure out that magic url sticky me and i will email it to you.
I'll start by saying the only place the resources exist for a next-generation search engine are on peer-to-peer networks
Agreed, although if grub manages to attract the DC community to support them, then I'd see them as a formidable power in the future.
Maybe if Shippo and Gigablast teamed up with some sort of shared human index, then I would find that very exciting.
Rob
I read somewhere that you do not use a regular database like oracle or informix to store your data because it is too slow. why? flat data file is processed faster?
A typical RDBMS provides a lot of functionality that isn't needed by a search engine. But this functionality makes it slower compared to a tailored solution. I wouldn't call it a "flat data file" though; think of it as a specialized version of a RDBMS that just does what the engine needs - in a very efficient way.