thinking of coding my own semi in-site search engine

Forum Moderators: open

Message Too Old, No Replies

thinking of coding my own semi in-site search engine

is this stupid, crazy, or both?

SubZeroGTS

4:23 am on Jan 29, 2003 (gmt 0)

i wanted to make an engine that does specific sites, and only spiders domains that i assign it to.

but i want it to work like google, with pagerank and whatnot, but i have a few other ideas i'm thinking of throwing into the mix

anyhow, any suggestions on existing scripts that i can look at for ideas?

i would estimate only a million or few million pages indexed. drive space and bandwidth isn't a problem. server load would be, so i would really like ideas on the most efficient method to do a google-type indexing search engine. i have no idea where to begin as far as algorithms go, i was thinking of using php/MySQL but i figure using MySQL's built-in search functions is probably suicide?

as for the spider, i was just thinking of running it off my own systems (seperate from server w/search engine) and transferring indexed info to the engine's server once every other week or something.

ukgimp

8:50 am on Jan 29, 2003 (gmt 0)

You could have a look at this for a few ideas:

ht*tp://www.onlamp.com/lpt/a/2753

jeremy goodrich

5:50 pm on Jan 29, 2003 (gmt 0)

There are a number of open source search engines out there. Search Tools [searchtools.com] has a lot of info that could help you get started.

But -> that bit about 'pagerank'...you know it's patented, ya? So you couldn't actually build your own engine to use it unless you get permission from the owner of the patent.

Kurupt

7:44 pm on Feb 11, 2003 (gmt 0)

Subzero,

I was searching around on Google to get some help with MySQL. I came across a page that demonstrated boolean search using MySQL 4. If I come across the link once I get home I will post it in the thread or get it to you some how. I think this will help you a great deal.

jimbeetle

8:11 pm on Feb 11, 2003 (gmt 0)

i want it to work like google, with pagerank and whatnot, but i have a few other ideas i'm thinking of throwing into the mix

Even with just "a million or a few million pages indexed" I think you might consider just how much raw computing power that's going to be needed to figure out how many pages are linked to what and assigning some sort of number to that -- before getting into anything else thrown into the mix.

And spidering a million pages at say a second each comes out to about 278 hours!

Whew!

WebGuerrilla

8:21 pm on Feb 11, 2003 (gmt 0)

>>And spidering a million pages at say a second each comes out to about 278 hours!

That's assuming that your system only runs a single connection at a time, which no system designed for that many docs does.

The average time (without racking up huge bandwidth costs)for a million docs is about one day.

jimbeetle

8:42 pm on Feb 11, 2003 (gmt 0)

WebGuerrilla,

Sure. I just don't know what SubZero has to work with and if it's been taken into account. I tend to look at the simple stupid things first cause that's about the only stuff on my level.

Jim