Forum Moderators: bakedjake
i'm looking for a free or low cost search engine software Google like. It should has a Spider, of course, and all main features of a good search engine.
Do you know anyone?
I'know that it's very difficult to find it, but i want to try to find it.
Thanks in advance,
Dany
[edited by: Marcia at 8:52 pm (utc) on Aug. 4, 2002]
[edit reason] no sig URLs, please [/edit]
You could try [xav.com...] - They have one or 2 scripts but there are nowhere near as fast as they would need to be to index 1 billion + pages, great script though! auto installer and everything!
[google.com...]
i already knew that one, but it's too elementary ....
What about [aspseek.com...] ?
The source code page is on [aspseek.org...]
Let me know what do you think about this.
Thanks,
Dany
One more point. Ever concidered how hard it would be to market a search engine, on search engines... scary thought.
First time I have heard Google search apliance refered to as "elementary" :)
Grrrr, my current bugbear!
Pionseek - you are going to need a HUGE amount of server space to do any sort of crawling of the web. Unless, of course, you want to focus on one subject....
Thanks for support, guys, greetings from Italy!
Dany,
no sig urls, please, thanks
[edited by: jeremy_goodrich at 10:29 pm (utc) on Aug. 13, 2002]
[edit reason] sig url poster dropped twice against tos [/edit]
I built The Snewp [snewp.com] using only PHP with a MySQL DB. It's source data set is pretty specific (only 14,000 sources), but it IS a specialty search engine, and does what it was created to do.
A fully spidering search engine is relatively easy to build - the issue that will get you in the arse is almost always the scalability factor. Being able to index gigabytes of data is almost painless -- being able to store and manage gigabytes of data is painful. Once a spider starts indexing pages, every layer grows exponentially.
For example, you start a spider on one page, let's say the main page of this forum. It has links to the forums, links in messages, etc. By the time you finishing spidering this site alone, you are likely to end up a thousand links. Then, the spider starts on that thousand, each one of which may lead to a couple hundred more. So, you have 1000 * 250 -- that is just 2 layers. Keep doing that (as Google does), and you will quickly have factors of millions.
Incidentally, in regard to ODP:
Last year, I ran a test on ODP's link list. After indexing only 250,000 sites - I had found over 25,000 of them dead or moved (10%).
If I had started with ODP's Adult base category (the invisible one), the percentage probably would have been much higher. Basically saying that ODP's data is getting archaic - and there isn't much hope of a rebuild.
It may be that your spider is just too sensitive; you need to review the same sites at least twice over a few days, or you will pick up many deadlinks that are not dead at all - ODP's combination of spider and editor ensures a site is dead on at least three visits before it's removed; human beings review spider-identified deadlinks - every one of them.
Over the last year, I've edited large categories in Society, Shopping and Regional branches; the [not quite] monthly robot review has thrown up fewer than 5% deadlinks in "my" categories - in some stable areas, fewer than 1%.
There'll always be link-rot, so long as the Internet has it; the trick is in how effectively it is dealt with. And I know that the vast majority of deadlinks are removed or suspended, ODP-wide, before the next check. Some, of course, get missed, but most forwarding (eg to domain sales or ebay) is picked up.
ODP's data is getting archaic - and there isn't much hope of a rebuild.
Simply silly; ODP is a living directory, why does it need a rebuild? I think you are confusing ODP with some of the directories using ODP data - some download too infrequently, others once and never again.
But that's up to them, I'm afraid. If you use Google, you'll find the ODP results (those with directory details) much more rot-proof than Google's own results.
The search side is also conceptually easy but difficult to implement in a scalable fashion. Throwing things in MySQL or whatever and using the fulltext index almost certainly bounds you to failure. There are a lot of tricks that you can use though to make things appear better. For example, I have several database tiers of varying sizes which are both refreshed and searched in a prioritized fashion. This makes it easy to make the common queries satisfied with fresh results.
please read the TOS [webmasterworld.com]
[edited by: jeremy_goodrich at 10:38 pm (utc) on Aug. 13, 2002]
1) I did NOT include Adult category, which A) is of no interest to me for statistical purposes, and B) has hideous link rot - just by random checking.
2) Failure as my test defined it: Anything but a valid page. This includes any sort of HTTP "error" such as redirects, not found, not authorized, etc. So, basically everything BUT the kitchen sink. You would be surprised at the number of links that have domains that aren't even registered anymore.
3) I used DMOZ's RDF file for the original source list - not a second hand source such as Google Directory.
----------
I don't have any of my statistical printouts around anymore, but I wouldn't mind running my DMOZ indexer again at some point - if people are interested in actual numbers.
I hope to one day get around to writing up a different sort of search engine - a cross of the DMOZ category and editor approach, and the Google spider approach - but fully web services based. This approach would allow people to add sites, editors to manage them, and the engine to poll them on a regular basis, tracking availability, reliability, etc. Writing it would be rather simple, but having the bandwidth and hardware to run it would be more complicated. In the end, it would be a pointless adventure, since the same information could be gained from cross-searching a few of the bigger engines, such as DMOZ, Google, AllTheWeb, etc.
This idea is basically what RSS Engine does - but RSS Engine is limited to RSS/RDF (XML) data, and not standard HTML, etc. In just under three months, RSS Engine's list has grown from a few thousand to almost 15,000 - and growing still.
----------
PS: If you have any RSS/RDF feeds on your site, and you see RSS Engine, Snewp, or Syndic8 in your log files as a UserAgent - you are being indexed. If you don't, and want to be indexed, throw me a note - we will get you into the database (more the merrier, eh?).
what about Aspseek Search Engine? At the beginning it seems to be a good free search engine! But i've been not able to try it because of hardware needed!
Aspseek Example is avaible on [aspseek.com...] (Here you can see a valid example of its running);
Moreover on [aspseek.org,...] you may find the source files!
Let me know what do you think about it!
Regards,
Dany
I don't have any of my statistical printouts around anymore, but I wouldn't mind running my DMOZ indexer again at some point - if people are interested in actual numbers.
I'd be fascinated; but I'm not sure if it would be helpful. This discussion, and a parellel one at [webmasterworld.com...] have shown that all the vocal ODP editors (including me!) find a much lower level of link rot than you, however defined. The consensus appears to be 1-5% across the board, though all agree that there's variation.
"Robozilla" calls every 6-8 weeks, so your 'random audit' of 10% has to be taken seriously.
However, we are clearly not comparing like with like, and until someone who understands your method - and who knows about Robozilla - cares to identify the problem, we're unlikely to progress!
For what it's worth, one possible cause of many instances is the ODP preference for listing "The Domain", with many .com domains now reading your browser and forwarding to any one of 32 nonsense URLs.
Now, we can debate how best to list such sites, but it would be severely unfair to refer to their forwarding as "Link Rot", especially as the idiots, sorry, progressive webmasters, probably paid many thousands for that gimmick (and so wasted that supa dupa .com). Such is progress!
At this stage of the Internet's life, many sites do crazy things, not for their benefit, or the visitors, but "because they can"; hence pop-up cancer that drives away visitors, and interminable flash that sends them to sleep; all of us, including ODP, have to find ways of coping with this - and any measure of 'Link Rot' must separate that from 'Brain Rot', a much more serious problem. :) :)
To build a specialized smaller search engine that you post on your site and get 100,000 searches per month, is probably under $50K.
To license somebody else's search engine (big name) and to add some special sauce is probably $100K.
Building a search engine to handle 100k searches a month will not cost you anywhere near $50k - even if you had a couple of leased servers to pay for each month.
I built The Snewp - both hardware and code, and have spent less than $1000, even including my hours. Granted, The Snewp only does about 10k queries a month, but it is specialized and relatively new. It can handle 100k queries per month without any major upgrading. In the end, I would still be under $1000 for the entire project.
Anyway, not picking a fight - just introducing a bit of perspective. :)
The Snewp is a great news search, but I think that is a very different challenge than an algorithmic search engine that crawls and indexs 100M+ documents and returns results in 500ms. I suspect that using a limited number of news feeds, and doing a simple search term lookup in a title or body of a news piece is much much easier. This is not to take anything away from you. What you have is awesome for $1K and would probably cost your average internet company 10-100x that. (Think of all that time in meetings to get your concept in the product development queue, and more time devoted to QA than programming). BTW I am not a progammer or in product development.
A fairer comparsion would be a search engine that some ex-infoseek engineer has been working on. It probably has not cost a lot (other than his time) to build it, and it has a large index (I think it was 70M docs), but it does not look begin to look like the next Google either yet. I can't remember the name of this search engine, but it shows up on some of the mass search engine submit software.
If you want to build the next Google, it is going to require a subtantial amount of investment on the magnitude I suggested above. Because you will have to be so much better than Google to even get a mention given their star power. Teoma did sell for something like $5M (and most of that was for askjeeves stock), and people debate about whether they have a chance of being a serious player, but they need to invest a lot if any major website is going to seriously ditch their existing search solution.
Just my two cents