Forum Moderators: open
but i want it to work like google, with pagerank and whatnot, but i have a few other ideas i'm thinking of throwing into the mix
anyhow, any suggestions on existing scripts that i can look at for ideas?
i would estimate only a million or few million pages indexed. drive space and bandwidth isn't a problem. server load would be, so i would really like ideas on the most efficient method to do a google-type indexing search engine. i have no idea where to begin as far as algorithms go, i was thinking of using php/MySQL but i figure using MySQL's built-in search functions is probably suicide?
as for the spider, i was just thinking of running it off my own systems (seperate from server w/search engine) and transferring indexed info to the engine's server once every other week or something.
But -> that bit about 'pagerank'...you know it's patented, ya? So you couldn't actually build your own engine to use it unless you get permission from the owner of the patent.
i want it to work like google, with pagerank and whatnot, but i have a few other ideas i'm thinking of throwing into the mix
Even with just "a million or a few million pages indexed" I think you might consider just how much raw computing power that's going to be needed to figure out how many pages are linked to what and assigning some sort of number to that -- before getting into anything else thrown into the mix.
And spidering a million pages at say a second each comes out to about 278 hours!
Whew!
That's assuming that your system only runs a single connection at a time, which no system designed for that many docs does.
The average time (without racking up huge bandwidth costs)for a million docs is about one day.