homepage Welcome to WebmasterWorld Guest from 204.236.255.69
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Alternative Search Engines
Forum Library, Charter, Moderators: bakedjake

Alternative Search Engines Forum

    
SearchHippo
a few thoughts
cornwall




msg:460604
 9:51 pm on Jul 1, 2003 (gmt 0)

I don't know how many of you have looked at SearchHippo [searchhippo.com] , I certainly had not until today. It came up in a thread that Jeremy started [webmasterworld.com]

It prompted me to look at SearchHippo (in spite of getting little traffic!) . Here are a few things I learnt

1. SearchHippo is the exclusive hobby of Kevin Marcus...currently Chief Software Architect at a large internet company (Kevin posts here from time to time..g'day Kevin ;) )

2. Rather than attempt to index the entire web, the SearchHippo spider (named "fluffy", after a pet tarantella) instead is first seeded with sites that are currently listed on the larger directories on the internet like DMOZ.

3. SearchHippo has several different distribution mechanisms. The free keyword listings only appear on traffic direct to searchhippo.com, or about 10% of the total distribution. These results do not get syndicated through the XML feed

4.They have an add url page ;)
" Rather than focus on finding the most obscure pages on the web, we focus on providing the best quality results - which means we need your help! We have a simple mechanism for getting listed and are always looking to expand our database with quality entries"

5. The front page counter showed 1.3 million queries to date today

The main weakness that I saw was that the data base did not contain enough sites to give a good spread of results For example if you look for a hotel in London, you do not get many relevant results, and very few for actual individual hotels. After the first couple of pages the results are not really very relevant

The problem is compounded by the fact that users are allowed to input their own keywords (OK webmasters scream for this) so the algo throws up stuff that has asked for "london" to be in their keywords, but have nothing to do with hotels

I appreciate that it is a hobby for Marcus, and its a very good concept allowing input of sites the way he does, but a little tweaking might improve the algo. My own feeling is that "search phrases" rather than search "words" may be the direction to go.

(a second smaller point was that I apparently could not submit a second or more site without creating another account. It might be an idea to allow the submission of other web sites off the one account)

SearchHippo is an interesting concept and deserves to succeed. It must be extremely time consuming, but I think input from WebmasterWorld could help Kevin tweak the algo.

 

SlowMove




msg:460605
 10:15 pm on Jul 1, 2003 (gmt 0)

SearchHippo is an interesting concept and deserves to succeed. It must be extremely time consuming, but I think input from WebmasterWorld could help Kevin tweak the algo.

SearchHippo discussed here:
[webmasterworld.com...]

Brad




msg:460606
 1:03 am on Jul 2, 2003 (gmt 0)

Searchhippo has crawled deeper recently and as a result I've noticed some more hits.

SH seems to do best with generalized 1 and 2 keyword searches. But I think it has shown vast improvement over the last year.

I do think their providing a free, with atribution, search feed to anyone is a great marketing tool. It works as very good backup results on a small directory.

kmarcus




msg:460607
 1:35 am on Jul 2, 2003 (gmt 0)

I currently am working on the "version 2.0" which is completely different from the version that is live now. THe preliminary results look decent, but of course who knows. I need to buy some more disk space to finish up the build. However, once I get that together -- if it can handle the load i currently then I will put the "algo source" out there as well for people to be able to play with given the sample data that I provided the other day. This is still some ways out though of course and I'd consider it all prototype stuff for now.

ALso, I guess i should update the 'about me page' -- I am no longer with that software company, nor in that position. ;)

Like I have said many a time, I'm all ears for people who want to help and get involved here. what can i give to you-al that would help you-all get better involved?

papamaku




msg:460608
 8:37 am on Jul 2, 2003 (gmt 0)

How about a little story, all about a hippo :)

if you're really serious about opening things up, maybe some other people here who have been working on their own little projects might offer up what they've done, which could bring new aspects to yours.

maybe you could give us the bakground to how you got to where search hippo is now (describing your infrastructure, technologies etc) and what your plans are for the future.

cornwall




msg:460609
 9:01 am on Jul 2, 2003 (gmt 0)

Kevin

Let me take a point at a time.

Simplest place to start is with submissions.

Am I right in saying that every time I wish to add a site I need to set up a new account complete with new email address and password.

Like many here I have lots of sites that I could add. Having added one, I cannot see how to add another without opening another "account"

The second account will not let me use the same email address, therefore I need a second email address for a second account. Here is me trying to cut down on email addresses and yet, to add a lot of sites, I need a lot of email addresses.

I know spam is a problem, but there is probably an easy way round this

Dayo_UK




msg:460610
 9:15 am on Jul 2, 2003 (gmt 0)

I use SearchHippo as a Directory back fill - Cheers kmarcus ;) - and I believe that a lot of other directories do.

Searchhippo is ideal for this service and as Brad said this is a great marketing tool and hopefully version 2 will have similar opportunities.

Another interesting feature that was touched on in the discussion about alternative search engines is the ranking boost (like Gigablast) that is applied to sites that link back.

Searchhippo also features a toolbar and pages are given a Hiprank - not sure how this works - obviously part of the algo.

Fischerlaender




msg:460611
 12:27 pm on Jul 2, 2003 (gmt 0)

kmarcus tells on his SEO stuff page about his ranking algorithm and the parts that influence HipRank. There is one paramter to HipRank I really love:
Sites requiring frames are penalized

If all the great engines would start doing such things, chances are that a lot of bad techniques would disappear.

kmarcus




msg:460612
 3:53 pm on Jul 2, 2003 (gmt 0)


Am I right in saying that every time I wish to add a site I need to set up a new account complete with new email address and password.

The primary reason here is spam prevention -- that is, while it is possible to come up with a system that allows a user to create hundreds of mailboxes and so forth for validation, I would be okay with that, because you have to activate the account. Of the nearly 250K users who have submitted listing, only about 40% have activated their account, so I think it is safe to say that not only doe sit prevent spam prevention, it prevents user submission (which is bad). The other piece of that puzzle was to try and encourage people to link back. That is, unless you validate the accounts email address, you site is not active.

People also complain a lot about the keywords thing and dont really understand how all that works. THe keywords must be comma separated and it is intentionally small so (64 chars) to encourage people to chose wisely. Now the fact is that after a month or so these sites are actually spidered and indexed anyway, so it's more or less moot at that point.


The second account will not let me use the same email address, therefore I need a second email address for a second account. Here is me trying to cut down on email addresses and yet, to add a lot of sites, I need a lot of email addresses.

I know spam is a problem, but there is probably an easy way round this

People occasionally email in and ask and i will happily take a file with title/description/keyword list and add them to the index, but that isn't in the 24 hr queue -- its in the monthly queue.


Rather than attempt to index the entire web, the SearchHippo spider (named "fluffy", after a pet tarantella) instead is first seeded with sites that are currently listed on the larger directories on the internet like DMOZ.

While I do use the *list of urls* in dmoz, I do have a variety of other sources including gtld namelookup server logs, toolbar user surfing, obviously links I find, etc. My original idea was to try and keep the ratio of domains to urls in check. That is, sites like geocities wouldnt have a whole bunch of listings unless each site had a domain to go with it. I am slowly forgetting this idea but I'm not quite there yet.


The main weakness that I saw was that the data base did not contain enough sites to give a good spread of results For example if you look for a hotel in London, you do not get many relevant results, and very few for actual individual hotels. After the first couple of pages the results are not really very relevant

I'm Always working on the relevance stuff! WHen you fix it for one type of query, it breaks for another. ;) Shippo 2.0 should be much beter at a lot of these things but there will be an intermediate one probably much more soon to get some user feedback.


if you're really serious about opening things up, maybe some other people here who have been working on their own little projects might offer up what they've done, which could bring new aspects to yours.

Well, even there there is no source, I have decided the term I will use isn't goign to be open source or closed source, but "ajar source". :) heh. I already posted up a large dataset 2m which allows people to experiment and see how i do things and there are also various documents within the about section. I wil continue to open things up as stuff migrates out of this hack of a c/c++ arsenal of utilities in to a streamlines php/c/mysql version. Assumingin all my preliminary tests are sucessful. :)


My issue with framed sites is that I end up indexing one of the frames as a part of the frameset and the link goes to the individual frame. then the site owners email in and complain. I'd rather just not list framed site than deal with their arrogance.


trillianjedi




msg:460613
 4:02 pm on Jul 2, 2003 (gmt 0)

Hi Kevin,

I have recently been looking at SHippo and I do think it is the basis of a great engine, so keep up the good work. It doesn't go unnoticed!

In terms of your "ajar source" idea, is there any way that you could release builds of "fluffy" so that other users could run spiders on their bandwidth, and have the collected data integrate with your database?

It seems to me the key to a succesful engine is database size, once you're happy with the algo of course.

I'm not sure of your bandwidth capability, but that is going to be an obvious achilles heal without the might and power (ie $$) of the big guys.

TJ

Brad




msg:460614
 4:28 pm on Jul 2, 2003 (gmt 0)

>>People also complain a lot about the keywords thing and dont really understand how all that works.

>>Now the fact is that after a month or so these sites are actually spidered and indexed anyway, so it's more or less moot at that point.

Remember NBCi Live Directory? If the sites are going to be spidered in anyway after a month (a much better thing), then how about just giving each new submission blank forms for 10 - 12 keywords, like Live Directory did in the old days? That will tide them over for a month until they can be spidered and it makes it simple for beginners and might eliminate some of the complaints and questions.

kmarcus




msg:460615
 5:01 pm on Jul 2, 2003 (gmt 0)


In terms of your "ajar source" idea, is there any way that you could release builds of "fluffy" so that other users could run spiders on their bandwidth, and have the collected data integrate with your database?

I run all of this out of my house, so i have two ds1's load balanced over bgp right now. I recognize this isn't the most cost efective but it sure is heck is the most fun! Like I said in previous threads, the big issue on spidering isn't so much the bandiwdth issue but rather the dns lookups issue. You'd be amazed at the amount of time that dns lookups take because they often time have to timeout or whatever. So that causes me a lot of headache. And of course once you got a site working, you can't pound the heck out of it real fast because then you cause grief for the webmasters. So it's a bit of a catch 22. I currently randomize the list and just let it go from there - that way i figur ei dont hit anyone too hard, but then i have that dns lookup issue. And again the issue is that the number of domains is large so there are a lot of lookups. caching helps but only when you have a lot of urls in the same domain -- and like i said earlier, my original objective was to index at least one pagein each domain rather than many pages in a smaller set of domains.

You can easily buy the .com/.net/.org list and you can download ripe/apnic/.ca's etc easily and go from there if you wanted but I have enough stuff to do as is.

As for letting other people run it in general though, I have no problems with that (grubbish i suppose). The current setup works with a list of urls that are fed into the spider which then goes off and does it's thing. Then the output from that gets parsed and tagged and indexed. I spent a day working on a funny curl-based php version to try and do the same thing but tried to build the parser in thre as well. This seemed to work at least somewhat okay although it would still need some work. In particular, php isn't the most ideal language for parsing html.

Remember, this generates huge wads of data, so you still have the issue of aggregating the data and there is still the potential for spamming - so a new "spam filter" would need to be built. But I am open to that of course. Also you wouldn't want someone to do somethign bad to get the user agent banned all over the place.

Lastly, every so often i go through the domain list and pullt he robots.txt files and then use that as a filter to suck out urls that are in the "to be spidered" list -- this is not inherently build into the spider itself - it is a separate process.

The current spider runs on freebsd 4.x. I can give you the executables for that and for the robot excluder amongst other things if that is an area youa re interested in looking at.


Remember NBCi Live Directory? If the sites are going to be spidered in anyway after a month (a much better thing), then how about just giving each new submission blank forms for 10 - 12 keywords, like Live Directory did in the old days? That will tide them over for a month until they can be spidered and it makes it simple for beginners and might eliminate some of the complaints and questions.

I will eventually just do away with this system in its entirety and just put up the "enter your email and url here" and let peopel do it as often as they like. it will make it easier for everyone, but this is low priority. again, high priority is getting the core indexer and ranker into mysql/php so that i can get peopel to play with that, which is really where the fun is. The spider is not fun because it is very very complex and difficult to get it right. ;)

cornwall




msg:460616
 6:12 pm on Jul 2, 2003 (gmt 0)

>>I'm Always working on the relevance stuff! WHen you fix it for one type of query, it breaks for another.<<

I understand Google have the same problem ;)

If is not commercially confidential, can you say how many sites/pages are in your data base now?

kmarcus




msg:460617
 6:22 pm on Jul 2, 2003 (gmt 0)


If is not commercially confidential, can you say how many sites/pages are in your data base now?

that also is pretty outdated info on the site - almost a year old! There are "40M" 'known' urls, but they're not all indexed for some reason or another. The real number is closer to about 15M pages of actual pages with content that are indexed. A lot of SE's will report numerbs that are things like "we know about this many urls but they're not actually indexed" or they will count the number of redirect urls, etc. This is the actual numebr of indexed pages i am talking about.

papamaku




msg:460618
 8:42 am on Jul 3, 2003 (gmt 0)

Wow, only 15 million pages - that's not many!

Do feel that you can offer the same level of relevance as SEs with 100M+ pages? If so, do you feel the pressure to get out there crawling, or is there something holding you back?

trillianjedi




msg:460619
 9:19 am on Jul 3, 2003 (gmt 0)

Do feel that you can offer the same level of relevance as SEs with 100M+ pages?

I think not.....

What's important are the ideas, methodology and attitude behind SeachHippo.

The database size will come in the future.

TJ

cornwall




msg:460620
 10:16 am on Jul 3, 2003 (gmt 0)

>>What's important are the ideas, methodology and attitude behind SeachHippo.
The database size will come in the future. <<

Spot on, trillianjedi

I like the ideas behind SearchHippo. For those of us with (not very) long memories, GigaBlast was struggling with their first million or so entries. Their counter kept going back to zero with crash after crash... now it has 188 million pages indexed.

SearchHioppo is a great idea and I wish Kevin luck with it. These things take (a lot of) time, and longer if you are working at it part time!

Fischerlaender




msg:460621
 11:51 am on Jul 3, 2003 (gmt 0)

Wow, only 15 million pages - that's not many!

How useful an index with "just" 15mio page is depends to a large extent on the crawling strategy. Crawling the "best" pages first, you can beat a, let's say, 100mio index, which hasn't a smart crawling strategy. But obviously, crawling the "best" pages first isn't very easy ...

In other words: Given enough hardware ressources, it is easier to build an index with 100mio randomly found pages than one with 15mio high quality pages.

brotherhood of LAN




msg:460622
 12:17 pm on Jul 3, 2003 (gmt 0)

Spider breadth first ;)

Great thread- kmarcus, how are you storing your URL's? Do you split it up by domain/tld/path? Also how to store the words, just a dictionary with word ID's?

Just wondering how you store the various elements of a page, is there a page about these things similar to your 'tech-overview.htm' page? :-)

Wouldnt mind knowing some of the more techy things, nice work with what's already made, up and running ;)

Glacai




msg:460623
 2:19 pm on Jul 3, 2003 (gmt 0)

Hi kmarcus,

Sorry to jump in with a question, hope its ok.

Can you give any more details how you get php to interact with your main searcher program/server, I'm currently using system() but of course want the main searcher to be already in memory, what should I be reading up on?

Also just want to say excellent stuff, any chance of seeing the source :) especially the multi word zigzag algo :)

Regards,
Marc.

kmarcus




msg:460624
 6:23 pm on Jul 3, 2003 (gmt 0)

Hi. To be honest, getting a large number of urls in the database isn't a big deal. Let the spider run for a month and blammo you have 50 or 100m. Actually, i think if you think about it, the relevance for the index size is pretty good, eh? ;) Besides, it makes it a hell of a lot easier to experiment. I can change things more quickly and make sure thing are fine tuned the way i want them without havign to spend a lot of resources. Once that gets "perfected", then you move on to the next step.


How useful an index with "just" 15mio page is depends to a large extent on the crawling strategy. Crawling the "best" pages first, you can beat a, let's say, 100mio index, which hasn't a smart crawling strategy. But obviously, crawling the "best" pages first isn't very easy ...

Well, using DMOZ as a base for urls is a nice way to start for "the best urls" if you will. I also have a few other dtabaase sources that I use, such as alexa's top xyz number of sites, "popular domain lists", peopel who use the toolbar all the jazz, so there is a decent set of urls. That is, I think of it as an 80/20 problem: i probably have 80% of the sites that you're likely to visit, but i havent crawled them very deeply.


Great thread- kmarcus, how are you storing your URL's? Do you split it up by domain/tld/path? Also how to store the words, just a dictionary with word ID's?

Just wondering how you store the various elements of a page, is there a page about these things similar to your 'tech-overview.htm' page? :-)

I am currently wokring on "version 2" of searchhippo which will be some c, some php and some mysql. If it turns out that it works out properly then great, and i think we are all going to have a lot of fun. ;) The current stuff is all in c/c++ and is very, very, very hackish. The search code was inspired by a project i worked on in college in 1995-1996 called okra which was an email searching service that mroe or less was a full text search engine. Of course this is not quite the same which is why i have issues with it.

Anyway, If you go to the tech-overview.php (not htm) page I just added a section "want to participate". There is a link there to a 2MB file which you can download that shows actual data from the spider, how it is manipulated and stored for the "up and coming" version. It is likely to continue to change and get further normalized to speed things up (like normalize the reverse domains and normalize the tags into id's). But this should be plenty of eye candy for anyone who is interested in playing.

Again, keep in mind this is where i am heading, not what i am doing now. Right now the lexicon is absolutely huge. I figure it should be probably around 10-15M "words", but it is actually 45M "words" for the backend so you can tell things are out of whack.


Can you give any more details how you get php to interact with your main searcher program/server, I'm currently using system() but of course want the main searcher to be already in memory, what should I be reading up on?

Also just want to say excellent stuff, any chance of seeing the source :) especially the multi word zigzag algo :)

I have two views (probably both of which are no good!) on how to search data. The first is the zigzag and the second is a "weighted listlength merger" whcih i currently am experiementing with. The service up and runnign today uses the weighted listlength merger routine, not the zigzag. I dropped the zigzag awhile ago just to play with other mechanisms.

the "weighted listlenth merger" works similar in start to the zigzag: grab a list of all the token:record lists. SO
a: 1, 2, 3, 6, 7, 10
b: 3, 5, 6, 7, 8
Then, by some magic forumla based on the length of the list, the frequency of occurance of the word and the weight of the word in a document, merge them together. The C code for doing this is incredibly obtuse and convoluted because i hacked it together. However, the php code is quite nice and small and i will post it up shortly. As a stop-measure, we also want to make sure we dont get stuck in a deathtrap, so I will weigh short lists heavier and not traverse a long list for "too long". Right now this is dynamic, but I am going to probablyb make it static (i.e. only look at the first couple thousand).

Since you obviously understand the zigzag concept the idea is also pretty easy to implement but it gets very bad when you have large lists of common words that infrequently intersect. Plus it makes it more difficult to weigh any individual word within the lexicon since ultimately it's goal is to skip through as fats as possible. The problem you have there then is when you have, say three words. You do the zigzag on the three words and dont find anything. that matches all three. But there are a ton that would have matched two. So now you have to keep track of all your zigs-and-zags and weigh based on how many matches you have.

I am sure this code will be mostly useless to everyone but since you asked, here is the *old* code which i do not use anymore but originated from the OKRA school project (with some modifications of course). You will find it interesting to look at the IsMatch function - where I experiemented with all sorts of partial zigzag matching based on the number of terms and so forth. But anyway, you asked for it! :)


int GetMaxVal (int64_t *i, int NumVals)
{
int max = -1;

while (NumVals) {
if (*i > max) max = *i;
NumVals--;
i++;
}
return (max);
}

int64_t GetMatch (sTextDB *TDB, sQC *QC)
{
int i;
int64_t k;
int stop;

#ifdef LOGGING_ON
fprintf (LF.fp, "GetMatch -- all rows --\n");
fprintf (LF.fp, "(MaxTime=%d, MaxTries=%d, MaxAllTries=%d)\n", QC -> MaxTime, QC -> MaxTries, QC -> MaxAllTries);
PrintAllRows (QC);
fprintf (LF.fp, "--------------\n");
#endif

stop = 0;
do {
if ((time (NULL) - (QC -> TimeStart)) > (QC -> MaxTime)) stop = 1;

if (InUpperBounds (QC)) {
(QC -> Ctr)++;
if ((QC -> Ctr) > (QC -> MaxTries)) stop = 1;
k = GetMaxVal ((QC -> CurVal), (QC -> NumToks));
#ifdef LOGGING_ON
fprintf (LF.fp, "\nMaxVal=%qd, Ctr=%d\n", k, (QC -> Ctr));
#endif
for (i = 0; i < (QC -> NumToks); i++) {
(QC -> CurVal [i]) = IFSeek (&(TDB -> Index),
&(QC -> Status [i]), (QC -> XlatVal [i]),
(QC -> ListLen [i]), k);
#ifdef LOGGING_ON
PrintRow (QC, i);
#endif
if (((QC -> Status [i]) == KEY_NOT_FOUND) &&
((QC -> CurVal [i]) == -1)) stop = 1;
}
if ((!stop) && (IsMatch (QC))) {
k = (QC -> CurVal [0]);
for (i = 0; i < (QC -> NumToks); i++) (QC -> CurVal [i])++;
#ifdef LOGGING_ON
fprintf (LF.fp, "Rec: [%d]\n", k);
#endif
return (k);
}
} else stop = 1;
} while (!stop);
return (-1);
}

int IsMatch (sQC *QC)
{
int i;
int64_t k;
int x;

k = (QC -> CurVal [0]);

x = 1;
for (i = 0; i < (QC -> NumToks); i++) {
if (!(QC -> QFlag [i]) && (k!= (QC -> CurVal [i]))) {
x = 0;
}
}

if (x) {
if (QC -> QCtr) {
x = 0;
for (i = 0; i < (QC -> NumToks); i++) {
if ((QC -> QFlag [i]) && (k == (QC -> CurVal [i]))) x++;
}
}
}

#ifdef LOGGING_ON
fprintf (LF.fp, "IsMatch: x=%d, Ctr=%d, Max=%d, QCtr=%d\n", x, QC->Ctr, QC->MaxAllTries, QC->QCtr);
#endif

if (x) {
if (((QC -> Ctr) < (QC -> MaxAllTries)) && (x == (QC -> QCtr))) return (1);
if ((QC -> Ctr) < (QC -> MaxTries)) {
//fprintf (stderr, "[%d] of [%d]\n", x, QC->QCtr);
if ((QC -> QCtr > 2) && (x >= (QC -> QCtr) - 1)) return (1);
if ((QC -> QCtr > 4) && (x >= (QC -> QCtr) - 2)) return (1);
if ((QC -> QCtr > 6) && (x >= (QC -> QCtr) - 3)) return (1);
}
/*
if (((QC -> Ctr) < (QC -> MaxTries)) &&
((QC -> QCtr) > 2) &&
(x >= (QC -> QCtr) - 1)) return (1);
if (((QC -> Ctr) < (QC -> MaxTries)) &&
((QC -> QCtr) > 5) &&
(x >= (QC -> QCtr) - 2)) return (1);
if (((QC -> Ctr) < (QC -> MaxTries)) &&
((QC -> QCtr) > 8) &&
(x >= (QC -> QCtr) - 3)) return (1);
*/
/*
if (((QC -> Ctr) < (QC -> MaxAllTries) + TRY_STEP) &&
((QC -> QCtr) > 2) &&
(x >= (QC -> QCtr) - 1)) return (1);
if (((QC -> Ctr) < (QC -> MaxAllTries) + TRY_STEP * 2) &&
((QC -> QCtr) > 3) &&
(x >= (QC -> QCtr) - 2)) return (1);
if (((QC -> Ctr) < (QC -> MaxAllTries) + TRY_STEP * 3) &&
((QC -> QCtr) > 4) &&
(x >= (QC -> QCtr) - 3 )) return (1);
*/
}

return (0);
}

/*
int IsMatch (sQC *QC)
{
int i;
int64_t k;

k = (QC -> CurVal [0]);

for (i = 1; i < (QC -> NumToks); i++) if (k!= (QC -> CurVal [i])) break;

// (QC -> QCFlag [i]) = 1;
// (QC -> QCtr)++;

if (i == (QC -> NumToks)) return (1);
return (0);
}
*/

int InUpperBounds (sQC *QC)
{
int i;

for (i = 0; i < (QC -> NumToks); i++) if ((QC -> CurVal [i]) > (QC -> MaxVal [i])) return (0);
return (1);
}

int ProcessQuery (sTextDB *TDB, sQC *QC, int NumToRetr)
{
int i;
int k;

(QC -> NumMatched) = 0;

if ((QC -> NumToks) <= 0) return (0);

k = GetMaxVal ((QC -> MinVal), (QC -> NumToks));
if (k >= 0) {
for (i = 0; i < (QC -> NumToks); i++) (QC -> CurVal [i]) = k;
for (i = 0; i < NumToRetr; i++) {
(QC -> MatchedRecords [i]) = GetMatch (TDB, QC);
if ((QC -> MatchedRecords [i]) == -1) break;
}
(QC -> NumMatched) = i;
}
return (QC -> NumMatched);
}

int ProcessQueryContinue (sTextDB *TDB, sQC *QC, int NumToRetr)
{
int i;
int k;
int x;

x = (QC -> NumMatched);

if ((QC -> NumToks) <= 0) return (0);

for (i = 0; i < NumToRetr; i++) {
if (QC -> NumMatched < (MAX_RECS_TO_RETR - 1)) {
(QC -> MatchedRecords [i]) = GetMatch (TDB, QC);
if ((QC -> MatchedRecords [i]) == -1) break;
} else break;
}

(QC -> NumMatched) += i;

return (x);
}

int64_t ProcessQueryGetNextMatch (sTextDB *TDB, sQC *QC)
{
int i;
int k;

if ((QC -> NumToks) <= 0) return (0);

return (GetMatch (TDB, QC));
}


Chndru




msg:460625
 6:39 pm on Jul 3, 2003 (gmt 0)

Good luck SHippo!

papamaku




msg:460626
 9:02 pm on Jul 3, 2003 (gmt 0)

even though i don't understand C in any way, it is really cool of you kmarcus for sharing in this way.

i really wish u the best of luck with your hippo + i'm definitely gonna start using it more out of principle.

Glacai




msg:460627
 9:16 pm on Jul 3, 2003 (gmt 0)

Well I don't know what to say, except thank you kmarcus!

I'm just at the stage of trying to implement multi word search (in c) and have been thinking of ways to try and do it, I got the word lists loaded which contains docid, weight, occurence and positions, of course one word was easy so now on to the harder stuff.

I'm off to get a coffee and then read through that code, cheers its greatly appreciated.

Regards,

thewebboy




msg:460628
 8:00 am on Jul 6, 2003 (gmt 0)

Hi kmarcus, I like to say that I like Search Hippo a lot and plan to use your xml data to back fill my little engine.

You talked about MySQL/PHP, thats what my little engine does right now. I also use the Alexa top 500 for basic data. I use the built in mysql fulltext but is often hit and miss some times really relevant, other times way off.

papamaku




msg:460629
 9:16 am on Jul 6, 2003 (gmt 0)

kmarcus,

how come shippo doesn't show the total number of results found, just the number being displayed?

maku

kmarcus




msg:460630
 3:12 pm on Jul 6, 2003 (gmt 0)

shippo stops counting after 100 (or some other number i give it). Most search engines will then make a projection (i.e. guess) as to how many matches total there are. I consider this a low priority item though: How useful it is to get 2836462 results? i feel it is only useful for checking how popular words/links are -- and there are much better mechanisms for that. (of course i am open to persuasion though in this area to add a projector based system). Or, maybe you can do it. ;) I should have in another few days some of the code avaialble to work with that data for searches and you can do whatever you want. But remember you have to have scalability, so an actual count probably isn't going to happen.

On the other hand, I feel more useful would be "with this query, where would this url appear in the result set be?" is a much more interesting question that i am pondering. In other words, You want to optimiE for say, "fuzzy wuzzy". You type it into engine xyz and have to walk through pages and pages until you see you are the 89th link. I would be nicer to type "fuzzy wuzzy /urlpos:mydomain.com" or so, and get a result set saying "mydomain.com appears with the terms 'fuzzy wuzzy' in position 89.

Just as an aside, my day job allows me to work with billion of records, which we then sell. (we call it infocommerce, contrast to amazon -- we sell information products, they sell physical goods). Anyway, we find that the users who enter the dumb searches that return 8479827 documents don't ever buy anything. But rather they start buying around 20 results and less for their query. I had a plot once but I dont' remember it exactly but it was a nice geometric curve: a query returning 1, 2, or 3 results we always sold. One with 4-8 or so usually sold, 10-12 etc. the more results returned, the less likely the user would buy.

So I use that same philosophy: people might like to read, but not thorugh ten trillion screens of information.

papamaku




msg:460631
 3:20 pm on Jul 6, 2003 (gmt 0)

the
"fuzzy wuzzy /urlpos:mydomain.com"
query is an excellent idea, and I think it would be so much appreciated by web masters.

but wouldn't you have to be careful it doesn't get abused by programs like webpositiongold et al.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Alternative Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved