| 6:09 am on Dec 24, 2004 (gmt 0)|
why not go ask InfoSpace for a feed instead?
| 4:39 pm on Dec 30, 2004 (gmt 0)|
I've run a meta-search for some years now that searches 20 SE's (10 in 2 countries), combines the results and sorts using it's own algo. I'm busy creating a full-blown SE with it's own index to replace the meta search. Mostly because of load considerations.
Take shak's advice. Unless you're seriously planning to compete in the SE space, writing a truly paralel meta-search from the ground up is not for the faint of heart. However, if you know what a non-blocking socket is, then perhaps it's for you after all.
| 5:59 pm on Dec 30, 2004 (gmt 0)|
But how would I get the search results?
By opening a connection to Google(for example), requesting a search, and parsing the results?
If so that sounds illegal since I'm stealing somebody else's search results.
| 10:06 pm on Dec 30, 2004 (gmt 0)|
|If so that sounds illegal since I'm stealing somebody else's search results. |
Its not illegal per se, but its against their T&Cs, and from their point of view they will make a perfectly legit action of rejecting you access to their site.
|combines the results and sorts using it's own algo. |
I have always been curious how can that be done. I mean search engines use their own sorting based on parameters that they never expose to you, and then you take sorted outputs from X search engines and try to sort on your own, while totally lacking miriads of important things that search engines took into account, but you did not?
| 6:02 pm on Jan 7, 2005 (gmt 0)|
The meta search I run is a job search engine. We meta search guys like monster.com, yahoo's hotjobs etc. The ranking algo's they (the job SE's) use are simply keyword based. One problem we've had is that we don't have access to all the keywords for every job. We just get the title and summary info. We also don't have access to the location info (zip code for example) so we can't do fancy things like showing all jobs within a radius from a zip code. It's been very frustrating *sigh*.
But all that has been solved by rearchitecting our SE as a crawler rather than a meta search and indexing the full data for every job. The down side is we now have a 6 hour delay before a newly posted job appears on our SE. We've also developed a way to figure out the longitude and latitude of a job, which gives us the ability to do radius searches and distance filters.
We'll be launching later this month. msg me if you want the URL.
| 6:17 pm on Jan 7, 2005 (gmt 0)|
|But all that has been solved by rearchitecting our SE as a crawler rather than a meta search and indexing the full data for every job. |
Ah, lets me just get it right -- effectively you use other search engines to narrow choice of pages that you crawl yourself and then rank using your own algorithms?
| 7:19 am on Jan 8, 2005 (gmt 0)|
no. We have our own crawler that works the same way any other SE does. We got out and index everything we can find on a remote website while respecting robots.txt and search the index.
| 2:48 pm on Jan 8, 2005 (gmt 0)|
I've built two metasearch engines.
One in Java and the other in Python. The Java one is threaded and can be run on multiple servers. The Python one is simply a foundational class taken from the Java application as a proof of concept.
Anyway, you'll need the following:
a regex class
a class for each engine
a class to gather the results
a class to perform various algorithms on the results gathered
a class to display the results to the user
a lot of JSP and servlets to handle the above
You'll need to carefully consider timeouts on the various engines.
You'll need to consider how long your user must wait, and what will they do to adjust the wait.
Finally, you'll need to consider if you even want to do it. I'd like to go into detail about my background but I can't. Bottom line, virtually every search engine TOS says you can NOT metasearch their results. You will be BLOCKED. Google in particular will NOT allow you display their results ANYWHERE, unless you pay....and the fee is quite high. Bottom line, metasearch as we once knew it, is probably dead, if done 100% by the TOS. Most metasearch sites are not following the TOS of the various search engines, or else they are paying big bucks for permission.