| This 78 message thread spans 3 pages: 78 (  2 3 ) > > || |
|create a new search engine|
search engine spider new
What are the main points to create a new search engine like google.
Is this technically difficult?
Can someone shed some lights on this or point me to some resource?
Google for: Anatomy of a search engine
A tech paper by Stanford on Google, it will en-lighten you :)
while we are all trying to court google, are you sometimes feeling tired, like to try all your tricks to please the pretty girl google?
Why not we create our own mini version of search engines for all of our webmasters' special purpose?
I am willing to devote my spare time ....
when people search, they will have an option to do
1. commercial search
2. info resource search.
A business rank and a resource rank will be assigned to each site.
like amazon, business rank = 9, research rank = 1
stanford.edu, business rank = 1, research rank = 10
|brotherhood of LAN|
Since most mainstream algorithms look off the page as well as on it, looking up papers on links/link text and link structure would go some way in any new SE IMO :)
See here for the thread paynt started on the theory side of linking
On the page is links to a good few related papers. The forum libraries have some of the past discussions too.
If you study what SEO's are doing as well as the engines I think you'll find the karma in the middle ;)
I grabbed some java code before to compile a spider, which does retrieve all the web site contents you ask it to grab and all the linked web sites contents if you want it to follow the links too, including database file.
Basically it can grab everything from a web site.
I did not do much about it.
Google is just a more complicated, powerful spider, right?
and then add various algorithms to sort the spidered results.
I use Amazon for lots of research on books, CDs, DVDs etc, often without any intention of purchasing them from Amazon. Does this not mean that Amazon is in fact a very useful resource for research as well?
And will this be assigned by a person or automatically?
The idea has already been started by Abrexa [abrexa.co.uk], where they ask visitors to choose from four kinds of search:
Find a company
Your first problem will be getting enough servers and bandwidth to spider all of the pages. Perhaps you are underestimating the size of the web?
Then you need a server powerful enough to search through that database and return useful results in under a couple of seconds. Not easy.
|you need a server powerful enough |
Good point, Google have the largest server farm in the world, a 56,000 linux cluster runs it.
Can we do something like ODP/dmoz?
We ask for people to contribute their machines for this
We can start something small, not as huge as google?
Think simple, think storage.
Download entire web crawl to disk(s) (less images, complete .html, .pdf, .txt, whatever else you want) into storage, this will be the "cache" > parse data from the cache (remove data inside selected tags, remove tags, apply filters), this is how you tune the index for speed. > index parsed data > play with algo for retrieval.
Call the cache a "feature", no sense in wasting all that data.
"We ask for people to contribute their machines for this task. "
That's been done and is called Grub [grub.org]. It isn't a great solution.
"Can we do something like ODP/dmoz?"
Yes, although you will need to have a number of full time editors and/or lots of volunteer editors.
"We can start something small, not as huge as google?"
What's the point? Who is going to use something small when they can use Google or the ODP?
I think that you might be better off doing a specialised search facility, focusing on something that interested you. At least you stand the chance of both finishing it and people actually using it!
|We can start something small, not as huge as google?" |
What's the point? Who is going to use something small when they can use Google or the ODP?
Couple years ago you could say - who is going to use small unknown, students project called google when they can use AltaVista or hotbot. Alta... what?, hot... where?.
Anatomy of a search engine is very much outdated document, and google is becoming a victim of its own success as every other search engine in the past did.I would be very happy to try something new.
So stevegpan2, go ahead, start small as google did, you may become big.
I am working on my search engine. Sort of a garage project. Quite tough I can tell, but the variety of subjects involved make it exciting (I admit I am not a programmer).
I think it gets down to three parts equally important:
* crawler including the definition of which sites to crawl next
* sorting the sites (i.e. identifying the importance of each page by keyword)
* querying it in a mighty fast manner
If now a project is proposed here I would be happy to get involved.
|and then add various algorithms to sort the spidered results. |
This simple sentence hides the real secret of a successfull search engine. I'm developing a small engine in my spare time. The main problems I came across were:
* Build a robust spider, that doesn't crash on every third page because of the buggy HTML out there.
* Find a way to decide which page should be crawled next. (This is a very important thing, because there are people out there - I think they call themself SEOs - who try to mislead your poor robot and want it to follow their spam pages.)
* Find a way to store all the data.
* Find an algorithm to build an index out of your data that gives good results, scales with O(n) or even better O(log n) and can be easily parallelized.
* Do all of these things very fast.
So be aware that it's not a matter of a big server or a fast connection, but a matter of some real good concepts and programs.
Am really interested in what you're saying but not being in any way mathematical, what the hell does this mean:
|scales with O(n) or even better O(log n) and can be easily parallelized |
|Build a robust spider, that doesn't crash on every third page because of the buggy HTML out there |
The html parser that comes with perl is wonderful (of course, keeping an eye open while using it). Also, it runs a lot faster than I had expected.
O(n) means the number of efforts is linearly proportional to the web site numbers n,
O(NlogN) is more than linear but less than polynomial of the web site numbers n,
For example, or order N number from small to big, you need more than linearly proportional N times.
The part with the O(n) or O(N*log N) should tell you that you'll need a system which scales very well - in other words: to double the number of pages in your engine you should just need to double your resources. This behaviour is called O(n).
The HTML-Parser modules of Perl have some bugs, e.g. a severe memory leak. So I use the external program lynx to do the HTML-to-PlainText conversion, which runs without problems.
After consulting, I have started my own little search engine niche, i bought the pre-made software which includes giving off work to moderators and such, and hopefully that goes well because it will be in primarily in education for uni students. I am graduating soon, and will try to get lecturers from my own uni to hopefully participate in their specialist fields.
My answer to you is, among spidering and all other methods, quality , not quantity sites matter.
Dmoz is a good example, i guess the more sites that are to do with making money the more likely sites which try to use many different methods to cheat the system.
aka change page after being indexed in a catergory, most likely this will happen with my index in the future, but like dmoz, those individuals can be banned.
I would prefer to take a long time to make 10,000 sites listed, rather then spider crap.
I used to like the yahoo index because of its accuracy and results, even though the google model finds alot of relevant searchs , i believe a directory model will be infinitely better in accuracy and in the long run.
Changing how your page looks for google,inkitomi, hotbot,msn, etc to spider it is really sad.
Thats my two cents, anyone with advice on my own SE will be appreciated.
Putting aside questions about expensive hardware, could a search engine capable of competing with Google be developed as an open source project? Kind of like the Linux of search engines?
This ties in quite nicely with a pipedream we were discussing a few weeks ago - if one were to develop a brand new search engine from scratch, (search engine - not directory) what added functionality would people like to see from it?
It's a pretty rudimentary question, that I know. Google forms a pretty solid yardstick, but are there items that people would like to see included in their 'ideal' search engine?
GigaBlast's instant indexing is something to add to the wishlist, along with Teoma's nifty search refinement options. There's also the potential to include a comprehensive Flash indexing tool that's been developed - although I suspect this idea would meet with mixed reaction!
What would you guys like to see in an 'ideal' search engine?
Input much appreciated!
Ok lets think about this for a second.
Starting from the top, the users rudimentary requirements of a search engine are:
Speed, relevance, instinctive navigation and accessible
What do we mean by speed? - The time it takes to type in the results, receive the search and select the site to view. If someone has to take a long time searching badly written listings, then their perceived duration and thinking level are going to be higher. In terms of the speed of the results returned, then the quicker the better.
relevance - The capability of a search engine or function to retrieve data appropriate to a user's needs. Why does a user use a search engine? They want to find something and this can be pretty much sectioned in various ways. To tackle this, its probably best to give an example like the word "car". If you look at this search in the various Search engines, you'll see that there is a hodgepodge of commercial and non-commercial listings that whilst being relevant to someone wanting an overall idea, it requires further specification in order to have a better chance of finding what they want. What if certain filters could be put in place so that when the word "car" is searched, options for the type of search you wanted will appear e.g.
1) car (purchases, sales etc etc)
2) car (rental)
3) car (information, makes, model types etc etc)
4) car (whatever)
Whichever number/s are deselected will filter out these sites, not only giving the user much more satisfactory listings, but also provide very good indicators on what users are really searching for. I realise that it isn't plausible to do this kind of refining for every keyword so maybe the top 500/1000, will be the best approach. Things like mistype corrector are valuable assets to have as many users may not be aware of their mistakes and wonder why the data they require isn't available. I've barely scraped the surface of this subject though.
instinctive navigation - The psychology of Human Computer Interaction is a complex and fascinating subject that needs to be in the interface designers mind when they create the search front. I still find some search engines lack this vital understanding of how to help the user do what they need to do without requiring an explanation. The art is to simple design functions without losing the the power for the user to have very detailed searches and refinements. I think the advanced features are underused and if users understood how effective they were, then they will do if the information is conveyed across in the correct way. I've had the pleasure to view a librarians lecture on ways to research for a dissertation and if you compared their search methods to normal users, you realise that if only they could be "smartened up", they'd find their information much more efficiently.
accessible - Your pet Fido should technically be able to use this search engine if he really wanted to. Joking aside, thinking about the lesser able among us (disabled, aged) allows us to create a better environment for everyone. Things like the Google shortcuts in the labs section will be tremendously useful for people with manual dexterity problems or just those who want to search quickly in their own technique. Here's some food for thought though, the 50-60 year old category are the biggest spenders, drinkers, car buyers and leisure users, which surprises some people. Sites that don't tend to their slight impedences like arthiritis and presbyopia (hardening of the lenses in the eyes causing reduced sight) will miss out on them and their money. My having customisable style sheets and quick keys, they can access the site with ease and will continue coming back as long as everything else keeps them happy.
One final note, I haven't really talked about the the webmasters rudimentary requirements of a search engine. The webmaster has to have a love-hate relationship with the search engine, since if they just like it, then you've allowed it to be manipulated. This is the biggest fallback with the gigablast engine and why webmasters like it (amongst other things). I could almost instantly alter my position in Ixquick by optimising for gigablast until I'm happy. However, if the search engine is good to the "good" webmaster, the webmaster will be good to the search engine.
How about a fully customizable engine? The whole thing could fit on a five dollar CD. Anyone with an array of ten thousand computers could quickly install and modify the software.
The problem with search engines is that they are only as good as they are secret. If a person know about the search engine algorithm, then they know how to optimize exactly for it to manipulate its results. This causes poor output and is why search engines must be developed in small projects.
Linux isn't a secret. Yet you can customize a copy of the software and run all sorts of secret algorithms behind a firewall.
I would appreciate something that would be able to benefit from the context. Creating an open-source 'googleblast' type of engine that indexes quickly and has simplified access or scalability suggested above would not be really 'new', in my opinion.
When posing a real language question, you could get a real answer, rather than the usual bunch of sites that you can get anywhere: now this would be worth the effort.
This may be a distant goal, but the IT field advances very fast, so it will certainly be reached in a while.
I know there are open-source engines out there. How much can you really do with them?
that sounds like an "expert system", poluf.
If finding a good and unique alogrithm and keep it secret is so difficult, why now try the open-source-philosophy on the alogrithm itself? Something simple, easy to understand and honest, that can not be manipulated by webmasters because of its nature/design... (sometimes the most simple things are the best.)
waldemar: It may be an expert system, I don't know. Nevertheless I think this is or will be the 'real' next step. Meanwhile, of course, there may be newer engines, they might even crawl the entire web up to a day or two freshness (just based on these recent news from Caltech about the improved protocol).
This topic for creating a "search engine" - is one of my favourites:
From reading posts in the Webmaster Forum it is fairly clear that the "pool" of talent that is shown by the many posts from many webmasters, that the task of creating a new search engine or directory is well within the joint technical capability and capacity. This topic reappears every so often and in my opinion the reason that it hasn't been done appears to stem from the obstacle of "how to run with this concept" - I don't think that a single individual can achieve this - but - with a "joint project" - I believe that it could succeed.
Is it really necessary to have a "highly technical" sophisticated search engine?
Why not use a reasonably simple "directory"?
Why not let the big search engines help? - A "directory" could be listed and have thousands of static html pages listed and found on Google, MSN, Yahoo etc.....
Use the search engines to bring the visitors to the directory......
I believe the key to success will be the marketing of a new - directory!
Establishing a new "Brand" - imagine if 100 webmasters had created their own version of a Yahoo type directory?
With a well promoted "directory" - business's will want to be listed - pay to be listed - annual subscription - generating revenue.
With more business's listed in the directory, the directory will be found using increasing numbers of visitors and keywords.
Some thoughts for a basic "joint project":
Develop and build a website "A" - search engine - directory
Establish company "B" (eg 100 webmasters with equal shares) - promote website "A"
Exclusive contract whereby all revenue from website "A" goes to company "B" (100 webmasters with equal shares)
Assuming the technical and commercial details can be resolved (objective, organisation, operation and ownership)
Assuming that a "joint project" is the answer:
What structure would be suitable for a “joint project”, assuming a commercial venture, there is a requirement for such a project to have defined elements:
Creation of a new search engine - directory, with a commercial business model
What structure would be most suitable for the mutual benefit of participating individuals
How will the project operate from the initial planning stages, through development into functional operation.
This could be one of the greatest hurdles to overcome from a project involving "independent" team members. With a basic resistance to a project that does not have a clearly defined ownership agreement - outlining clearly defined mutual benefits.
| This 78 message thread spans 3 pages: 78 (  2 3 ) > > |