Forum Moderators: bakedjake
See here for the thread paynt started on the theory side of linking
[webmasterworld.com...]
On the page is links to a good few related papers. The forum libraries have some of the past discussions too.
If you study what SEO's are doing as well as the engines I think you'll find the karma in the middle ;)
I did not do much about it.
Google is just a more complicated, powerful spider, right?
and then add various algorithms to sort the spidered results.
And will this be assigned by a person or automatically?
The idea has already been started by Abrexa [abrexa.co.uk], where they ask visitors to choose from four kinds of search:
Buy Something
Find Information
Find a company
Be entertained
Your first problem will be getting enough servers and bandwidth to spider all of the pages. Perhaps you are underestimating the size of the web?
Then you need a server powerful enough to search through that database and return useful results in under a couple of seconds. Not easy.
Download entire web crawl to disk(s) (less images, complete .html, .pdf, .txt, whatever else you want) into storage, this will be the "cache" > parse data from the cache (remove data inside selected tags, remove tags, apply filters), this is how you tune the index for speed. > index parsed data > play with algo for retrieval.
Call the cache a "feature", no sense in wasting all that data.
That's been done and is called Grub [grub.org]. It isn't a great solution.
"Can we do something like ODP/dmoz?"
Yes, although you will need to have a number of full time editors and/or lots of volunteer editors.
"We can start something small, not as huge as google?"
What's the point? Who is going to use something small when they can use Google or the ODP?
I think that you might be better off doing a specialised search facility, focusing on something that interested you. At least you stand the chance of both finishing it and people actually using it!
We can start something small, not as huge as google?"What's the point? Who is going to use something small when they can use Google or the ODP?
Couple years ago you could say - who is going to use small unknown, students project called google when they can use AltaVista or hotbot. Alta... what?, hot... where?.
Anatomy of a search engine is very much outdated document, and google is becoming a victim of its own success as every other search engine in the past did.I would be very happy to try something new.
So stevegpan2, go ahead, start small as google did, you may become big.
I think it gets down to three parts equally important:
* crawler including the definition of which sites to crawl next
* sorting the sites (i.e. identifying the importance of each page by keyword)
* querying it in a mighty fast manner
If now a project is proposed here I would be happy to get involved.
and then add various algorithms to sort the spidered results.
So be aware that it's not a matter of a big server or a fast connection, but a matter of some real good concepts and programs.
The HTML-Parser modules of Perl have some bugs, e.g. a severe memory leak. So I use the external program lynx to do the HTML-to-PlainText conversion, which runs without problems.
My answer to you is, among spidering and all other methods, quality , not quantity sites matter.
Dmoz is a good example, i guess the more sites that are to do with making money the more likely sites which try to use many different methods to cheat the system.
aka change page after being indexed in a catergory, most likely this will happen with my index in the future, but like dmoz, those individuals can be banned.
I would prefer to take a long time to make 10,000 sites listed, rather then spider crap.
I used to like the yahoo index because of its accuracy and results, even though the google model finds alot of relevant searchs , i believe a directory model will be infinitely better in accuracy and in the long run.
Changing how your page looks for google,inkitomi, hotbot,msn, etc to spider it is really sad.
Thats my two cents, anyone with advice on my own SE will be appreciated.
It's a pretty rudimentary question, that I know. Google forms a pretty solid yardstick, but are there items that people would like to see included in their 'ideal' search engine?
GigaBlast's instant indexing is something to add to the wishlist, along with Teoma's nifty search refinement options. There's also the potential to include a comprehensive Flash indexing tool that's been developed - although I suspect this idea would meet with mixed reaction!
What would you guys like to see in an 'ideal' search engine?
Input much appreciated!
R.
What do we mean by speed? - The time it takes to type in the results, receive the search and select the site to view. If someone has to take a long time searching badly written listings, then their perceived duration and thinking level are going to be higher. In terms of the speed of the results returned, then the quicker the better.
relevance - The capability of a search engine or function to retrieve data appropriate to a user's needs. Why does a user use a search engine? They want to find something and this can be pretty much sectioned in various ways. To tackle this, its probably best to give an example like the word "car". If you look at this search in the various Search engines, you'll see that there is a hodgepodge of commercial and non-commercial listings that whilst being relevant to someone wanting an overall idea, it requires further specification in order to have a better chance of finding what they want. What if certain filters could be put in place so that when the word "car" is searched, options for the type of search you wanted will appear e.g.
1) car (purchases, sales etc etc)
2) car (rental)
3) car (information, makes, model types etc etc)
4) car (whatever)
Whichever number/s are deselected will filter out these sites, not only giving the user much more satisfactory listings, but also provide very good indicators on what users are really searching for. I realise that it isn't plausible to do this kind of refining for every keyword so maybe the top 500/1000, will be the best approach. Things like mistype corrector are valuable assets to have as many users may not be aware of their mistakes and wonder why the data they require isn't available. I've barely scraped the surface of this subject though.
instinctive navigation - The psychology of Human Computer Interaction is a complex and fascinating subject that needs to be in the interface designers mind when they create the search front. I still find some search engines lack this vital understanding of how to help the user do what they need to do without requiring an explanation. The art is to simple design functions without losing the the power for the user to have very detailed searches and refinements. I think the advanced features are underused and if users understood how effective they were, then they will do if the information is conveyed across in the correct way. I've had the pleasure to view a librarians lecture on ways to research for a dissertation and if you compared their search methods to normal users, you realise that if only they could be "smartened up", they'd find their information much more efficiently.
accessible - Your pet Fido should technically be able to use this search engine if he really wanted to. Joking aside, thinking about the lesser able among us (disabled, aged) allows us to create a better environment for everyone. Things like the Google shortcuts in the labs section will be tremendously useful for people with manual dexterity problems or just those who want to search quickly in their own technique. Here's some food for thought though, the 50-60 year old category are the biggest spenders, drinkers, car buyers and leisure users, which surprises some people. Sites that don't tend to their slight impedences like arthiritis and presbyopia (hardening of the lenses in the eyes causing reduced sight) will miss out on them and their money. My having customisable style sheets and quick keys, they can access the site with ease and will continue coming back as long as everything else keeps them happy.
One final note, I haven't really talked about the the webmasters rudimentary requirements of a search engine. The webmaster has to have a love-hate relationship with the search engine, since if they just like it, then you've allowed it to be manipulated. This is the biggest fallback with the gigablast engine and why webmasters like it (amongst other things). I could almost instantly alter my position in Ixquick by optimising for gigablast until I'm happy. However, if the search engine is good to the "good" webmaster, the webmaster will be good to the search engine.
Rob
When posing a real language question, you could get a real answer, rather than the usual bunch of sites that you can get anywhere: now this would be worth the effort.
This may be a distant goal, but the IT field advances very fast, so it will certainly be reached in a while.
If finding a good and unique alogrithm and keep it secret is so difficult, why now try the open-source-philosophy on the alogrithm itself? Something simple, easy to understand and honest, that can not be manipulated by webmasters because of its nature/design... (sometimes the most simple things are the best.)
From reading posts in the Webmaster Forum it is fairly clear that the "pool" of talent that is shown by the many posts from many webmasters, that the task of creating a new search engine or directory is well within the joint technical capability and capacity. This topic reappears every so often and in my opinion the reason that it hasn't been done appears to stem from the obstacle of "how to run with this concept" - I don't think that a single individual can achieve this - but - with a "joint project" - I believe that it could succeed.
Is it really necessary to have a "highly technical" sophisticated search engine?
Why not use a reasonably simple "directory"?
Why not let the big search engines help? - A "directory" could be listed and have thousands of static html pages listed and found on Google, MSN, Yahoo etc.....
Use the search engines to bring the visitors to the directory......
I believe the key to success will be the marketing of a new - directory!
Establishing a new "Brand" - imagine if 100 webmasters had created their own version of a Yahoo type directory?
With a well promoted "directory" - business's will want to be listed - pay to be listed - annual subscription - generating revenue.
With more business's listed in the directory, the directory will be found using increasing numbers of visitors and keywords.
Some thoughts for a basic "joint project":
Develop and build a website "A" - search engine - directory
Establish company "B" (eg 100 webmasters with equal shares) - promote website "A"
Exclusive contract whereby all revenue from website "A" goes to company "B" (100 webmasters with equal shares)
Assuming the technical and commercial details can be resolved (objective, organisation, operation and ownership)
Assuming that a "joint project" is the answer:
What structure would be suitable for a “joint project”, assuming a commercial venture, there is a requirement for such a project to have defined elements:
Objective:
Creation of a new search engine - directory, with a commercial business model
Organisation:
What structure would be most suitable for the mutual benefit of participating individuals
Operation:
How will the project operate from the initial planning stages, through development into functional operation.
Ownership:
This could be one of the greatest hurdles to overcome from a project involving "independent" team members. With a basic resistance to a project that does not have a clearly defined ownership agreement - outlining clearly defined mutual benefits.