Google like search engine ....

Forum Moderators: bakedjake

Message Too Old, No Replies

Google like search engine ....

Making a Google like search engine ....

Pionseek

8:18 pm on Aug 4, 2002 (gmt 0)

Hi Guys,

i'm looking for a free or low cost search engine software Google like. It should has a Spider, of course, and all main features of a good search engine.

Do you know anyone?

I'know that it's very difficult to find it, but i want to try to find it.

Thanks in advance,

Dany

[edited by: Marcia at 8:52 pm (utc) on Aug. 4, 2002]
[edit reason] no sig URLs, please [/edit]

Chico_Loco

8:24 pm on Aug 4, 2002 (gmt 0)

Hell yeah .. I wanna know where I can get one of those.. I sooo doubt theres one out there for anything even near cheap, never mind free!

You could try [xav.com...] - They have one or 2 scripts but there are nowhere near as fast as they would need to be to index 1 billion + pages, great script though! auto installer and everything!

ggrot

8:32 pm on Aug 4, 2002 (gmt 0)

Just buy the search engine from google themselves

[google.com...]

Pionseek

8:33 pm on Aug 4, 2002 (gmt 0)

Oh Thanks,

i already knew that one, but it's too elementary ....

What about [aspseek.com...] ?

The source code page is on [aspseek.org...]

Let me know what do you think about this.

Thanks,

Dany

mack

9:17 pm on Aug 4, 2002 (gmt 0)

There are a lot of very good open source search programs about. sourceforge.net would be a good place for you to start looking. But have an idea in mind what you are looking for before you go there and try and find a program that best matches your needs. Another thing you will need to think about if you are planning on starting your own search engine is hosting. To be Google like you will need thousands of servers, endless bandwidth and you will need to hire Googleguy.

One more point. Ever concidered how hard it would be to market a search engine, on search engines... scary thought.

First time I have heard Google search apliance refered to as "elementary" :)

nutsandbolts

10:20 pm on Aug 4, 2002 (gmt 0)

10 print "please enter search command";
20 input search;
30 if search = laptop goto 50
40 else goto 10
50 print "Linux on Laptop! (no matter WHAT you search with the keyword laptop!)"

Grrrr, my current bugbear!

Pionseek - you are going to need a HUGE amount of server space to do any sort of crawling of the web. Unless, of course, you want to focus on one subject....

Chris_R

1:01 am on Aug 5, 2002 (gmt 0)

I wonder how much it would cost to build the original google search engine.

It would of course not have anywhere near the scale that todays google has, but it would be kind of interestiung - especially if you could have access to all the admin features....

Pionseek

6:39 pm on Aug 5, 2002 (gmt 0)

No doubt that having a Google like search engine could be a dream, but i think that not all internet users appreciate all Google's features. But i believe that making a good search engine, even with free software, isn't impossible. Even if hardware sources needed will be huge!

Thanks for support, guys, greetings from Italy!

Dany,

no sig urls, please, thanks

[edited by: jeremy_goodrich at 10:29 pm (utc) on Aug. 13, 2002]
[edit reason] sig url poster dropped twice against tos [/edit]

KodeKrash

10:39 pm on Aug 6, 2002 (gmt 0)

Google is far less complex than it appears to be. Any search engine is, for that matter. It is all logic - so if you can follow the logic around in a few contorted circles, you have a good chance of being able to build your own.

I built The Snewp [snewp.com] using only PHP with a MySQL DB. It's source data set is pretty specific (only 14,000 sources), but it IS a specialty search engine, and does what it was created to do.

A fully spidering search engine is relatively easy to build - the issue that will get you in the arse is almost always the scalability factor. Being able to index gigabytes of data is almost painless -- being able to store and manage gigabytes of data is painful. Once a spider starts indexing pages, every layer grows exponentially.

For example, you start a spider on one page, let's say the main page of this forum. It has links to the forums, links in messages, etc. By the time you finishing spidering this site alone, you are likely to end up a thousand links. Then, the spider starts on that thousand, each one of which may lead to a couple hundred more. So, you have 1000 * 250 -- that is just 2 layers. Keep doing that (as Google does), and you will quickly have factors of millions.

Josk

8:38 am on Aug 7, 2002 (gmt 0)

Thats one of the uses of the ODP...as a useful start point. And to save time you don't to actually spider the site - just use the raw feed to get your start urls.

Pionseek

5:07 pm on Aug 7, 2002 (gmt 0)

HI KodeKrash,

good work! Has the Snewp even a Spider? I would like to use it on my site, is it possible?

Let me know,

Bye Dany

KodeKrash

5:27 pm on Aug 7, 2002 (gmt 0)

The Snewp is not a spidering engine because of it's set source group. It does run an indexer, and new sources are being added several times a week. All the data found in The Snewp's database comes from RSS and RDF feeds - 100% XML. The Snewp isn't available for private use. You can, however, integrate The Snewp's results into your own site using the RSS output from The Snewp. The engine itself is closed-source, although, it looks like I may be writing a personal version in the next couple months, probably to be called My Snewp.

Incidentally, in regard to ODP:

Last year, I ran a test on ODP's link list. After indexing only 250,000 sites - I had found over 25,000 of them dead or moved (10%).
If I had started with ODP's Adult base category (the invisible one), the percentage probably would have been much higher. Basically saying that ODP's data is getting archaic - and there isn't much hope of a rebuild.

Quadrille

10:48 pm on Aug 7, 2002 (gmt 0)

Not sure how you reached this conclusion, though I'm sure you are right about the Adult categories, and I'm sure some bits decay significantly faster than others.

It may be that your spider is just too sensitive; you need to review the same sites at least twice over a few days, or you will pick up many deadlinks that are not dead at all - ODP's combination of spider and editor ensures a site is dead on at least three visits before it's removed; human beings review spider-identified deadlinks - every one of them.

Over the last year, I've edited large categories in Society, Shopping and Regional branches; the [not quite] monthly robot review has thrown up fewer than 5% deadlinks in "my" categories - in some stable areas, fewer than 1%.

There'll always be link-rot, so long as the Internet has it; the trick is in how effectively it is dealt with. And I know that the vast majority of deadlinks are removed or suspended, ODP-wide, before the next check. Some, of course, get missed, but most forwarding (eg to domain sales or ebay) is picked up.

ODP's data is getting archaic - and there isn't much hope of a rebuild.

Simply silly; ODP is a living directory, why does it need a rebuild? I think you are confusing ODP with some of the directories using ODP data - some download too infrequently, others once and never again.

But that's up to them, I'm afraid. If you use Google, you'll find the ODP results (those with directory details) much more rot-proof than Google's own results.

kctipton

11:23 pm on Aug 7, 2002 (gmt 0)

re: ODP. Moved is not link rot, is it? Redirects happen all the time. I thought that link rot was 404 or similar errors.

kmarcus

2:18 pm on Aug 8, 2002 (gmt 0)

Actually, my experiences tell me that building a spider based search engine is not nearly as easy as the concept. The biggest problem I have had is that there are so many sites out there with messed up "html" on their site that parsing it successfully can turn into a real problem. Then you have to deal with all the spammer oriented issues which makes things even more fun. How should your spider deal with framed sites? Then, for extra fun, the people who have messed up HTML or use frames or whatever -- they find themselves in your index and maybe it's not quite what the want (i.e. you have alink back to the frame but not the frameset, or their description is messed up, whatever) -- and they complain compain complain about how you don't know what you're doing! Oh, and let's not forget about robots.txt!

The search side is also conceptually easy but difficult to implement in a scalable fashion. Throwing things in MySQL or whatever and using the fulltext index almost certainly bounds you to failure. There are a lot of tricks that you can use though to make things appear better. For example, I have several database tiers of varying sizes which are both refreshed and searched in a prioritized fashion. This makes it easy to make the common queries satisfied with fresh results.

please read the TOS [webmasterworld.com]

[edited by: jeremy_goodrich at 10:38 pm (utc) on Aug. 13, 2002]

Quadrille

2:33 am on Aug 9, 2002 (gmt 0)

re: ODP. Moved is not link rot, is it? Redirects happen all the time. I thought that link rot was 404 or similar errors.

Yes; but if KodeKrash is claiming 10% link rot, we must assume he's including every possible 'failure' (including a kitchen sink or two), to account for such a high figure. :)

KodeKrash

3:21 am on Aug 9, 2002 (gmt 0)

Just as a matter of reference:

1) I did NOT include Adult category, which A) is of no interest to me for statistical purposes, and B) has hideous link rot - just by random checking.

2) Failure as my test defined it: Anything but a valid page. This includes any sort of HTTP "error" such as redirects, not found, not authorized, etc. So, basically everything BUT the kitchen sink. You would be surprised at the number of links that have domains that aren't even registered anymore.

3) I used DMOZ's RDF file for the original source list - not a second hand source such as Google Directory.

----------

I don't have any of my statistical printouts around anymore, but I wouldn't mind running my DMOZ indexer again at some point - if people are interested in actual numbers.

I hope to one day get around to writing up a different sort of search engine - a cross of the DMOZ category and editor approach, and the Google spider approach - but fully web services based. This approach would allow people to add sites, editors to manage them, and the engine to poll them on a regular basis, tracking availability, reliability, etc. Writing it would be rather simple, but having the bandwidth and hardware to run it would be more complicated. In the end, it would be a pointless adventure, since the same information could be gained from cross-searching a few of the bigger engines, such as DMOZ, Google, AllTheWeb, etc.

This idea is basically what RSS Engine does - but RSS Engine is limited to RSS/RDF (XML) data, and not standard HTML, etc. In just under three months, RSS Engine's list has grown from a few thousand to almost 15,000 - and growing still.

----------

PS: If you have any RSS/RDF feeds on your site, and you see RSS Engine, Snewp, or Syndic8 in your log files as a UserAgent - you are being indexed. If you don't, and want to be indexed, throw me a note - we will get you into the database (more the merrier, eh?).

Pionseek

8:42 pm on Aug 9, 2002 (gmt 0)

Hi Guys,

what about Aspseek Search Engine? At the beginning it seems to be a good free search engine! But i've been not able to try it because of hardware needed!

Aspseek Example is avaible on [aspseek.com...] (Here you can see a valid example of its running);

Moreover on [aspseek.org,...] you may find the source files!

Let me know what do you think about it!

Regards,

Dany

Quadrille

1:46 pm on Aug 10, 2002 (gmt 0)

I don't have any of my statistical printouts around anymore, but I wouldn't mind running my DMOZ indexer again at some point - if people are interested in actual numbers.

I'd be fascinated; but I'm not sure if it would be helpful. This discussion, and a parellel one at [webmasterworld.com...] have shown that all the vocal ODP editors (including me!) find a much lower level of link rot than you, however defined. The consensus appears to be 1-5% across the board, though all agree that there's variation.

"Robozilla" calls every 6-8 weeks, so your 'random audit' of 10% has to be taken seriously.

However, we are clearly not comparing like with like, and until someone who understands your method - and who knows about Robozilla - cares to identify the problem, we're unlikely to progress!

For what it's worth, one possible cause of many instances is the ODP preference for listing "The Domain", with many .com domains now reading your browser and forwarding to any one of 32 nonsense URLs.

Now, we can debate how best to list such sites, but it would be severely unfair to refer to their forwarding as "Link Rot", especially as the idiots, sorry, progressive webmasters, probably paid many thousands for that gimmick (and so wasted that supa dupa .com). Such is progress!

At this stage of the Internet's life, many sites do crazy things, not for their benefit, or the visitors, but "because they can"; hence pop-up cancer that drives away visitors, and interminable flash that sends them to sleep; all of us, including ODP, have to find ways of coping with this - and any measure of 'Link Rot' must separate that from 'Brain Rot', a much more serious problem. :) :)

KodeKrash

4:11 pm on Aug 10, 2002 (gmt 0)

Just for the record, I am also a DMOZ editor.

Orion

4:35 pm on Aug 13, 2002 (gmt 0)

To build a Teoma or Wisenut and make it scalable to handle millions of searches per day, and to have an index of good quality larger than 100 million documents, probably takes 10-25 people and $10-15M. Not a lot of venture capitalists are signing up for this one today given the current algorithmic search market.

To build a specialized smaller search engine that you post on your site and get 100,000 searches per month, is probably under $50K.

To license somebody else's search engine (big name) and to add some special sauce is probably $100K.

KodeKrash

5:14 pm on Aug 13, 2002 (gmt 0)

I would have to disagree with Orion regarding the costs of building any search engine. In general, I find that technology companies bloat costs to almost no end, charging hideously for something that 13 yr old kids can do with a little bit of spare time if they really wanted to. I am not going to change the subject of this particular thread away from search engines, but I felt the need to put some perspective back into the conversation.

Building a search engine to handle 100k searches a month will not cost you anywhere near $50k - even if you had a couple of leased servers to pay for each month.

I built The Snewp - both hardware and code, and have spent less than $1000, even including my hours. Granted, The Snewp only does about 10k queries a month, but it is specialized and relatively new. It can handle 100k queries per month without any major upgrading. In the end, I would still be under $1000 for the entire project.

Anyway, not picking a fight - just introducing a bit of perspective. :)

Pionseek

7:14 pm on Aug 13, 2002 (gmt 0)

Hi KodeKrash,

you said that you spent about 1000$ for building the Snewp. What costs are included in it?

Regards,

KodeKrash

8:03 pm on Aug 13, 2002 (gmt 0)

Under $1000, actually. Costs include the server - a P2 500Mhz with 256Mb RAM, 20Gb ATA 133 IDE hard drive, dual 10/100 NICs and my time. I don't include a portion of my rather hefty connection cost, because that is incurred on a monthly basic, and I don't have enough history to really figure it in at this point. Purely a guess, but it will probably come out to be around $1000 for development and hardware, and an average monthly cost of $100.

Orion

10:00 pm on Aug 13, 2002 (gmt 0)

Hi KodeKrash,

The Snewp is a great news search, but I think that is a very different challenge than an algorithmic search engine that crawls and indexs 100M+ documents and returns results in 500ms. I suspect that using a limited number of news feeds, and doing a simple search term lookup in a title or body of a news piece is much much easier. This is not to take anything away from you. What you have is awesome for $1K and would probably cost your average internet company 10-100x that. (Think of all that time in meetings to get your concept in the product development queue, and more time devoted to QA than programming). BTW I am not a progammer or in product development.

A fairer comparsion would be a search engine that some ex-infoseek engineer has been working on. It probably has not cost a lot (other than his time) to build it, and it has a large index (I think it was 70M docs), but it does not look begin to look like the next Google either yet. I can't remember the name of this search engine, but it shows up on some of the mass search engine submit software.

If you want to build the next Google, it is going to require a subtantial amount of investment on the magnitude I suggested above. Because you will have to be so much better than Google to even get a mention given their star power. Teoma did sell for something like $5M (and most of that was for askjeeves stock), and people debate about whether they have a chance of being a serious player, but they need to invest a lot if any major website is going to seriously ditch their existing search solution.

Just my two cents