Forum Moderators: open
With the /ie interface, you can still add num=100 in the search command line and get 100 results instead of 10. That way you could even strip out the "Next" link. The snippets are there for your use, if you want to parse them out for non-IE users and highlight the keywords. I would strip out the Google logo, as well as their plug for their toolbar.
(This is not much different than what Google does when they cache our pages. They slap their own brand on their copy of our stuff, alter the coding, and recommend a bookmark that keeps folks away from us forever.)
An extremely simple interface would be to shell to a "lynx -source" dump to a file, and then read this file, parse out what you need, and spit it back out to the searcher. If you fork from your Linux CGI program to use lynx, you may need to escape the '&' in the long URL that you pass to lynx. Of course, you can just open a socket and parse on the fly, assuming that you are eager to figure out all that handshaking that even a browser like lynx probably does to make Google happy. (I tried a wget dump but got a Forbidden from Google.)
Also, delete the .lynx_cookies file first, if lynx is set up to accept cookies, so that Google has to issue a new ID for every search.
No advertising! No cookie for the user! No cache links! -- I like that, because I'm opposed to Google's cache, although a lot of searchers wouldn't like it. I would use POST instead of GET for the user's search terms, so that the terms don't end up in my httpd log, and I can brag that we do no logging of search terms.
A Google adbuster and anonymizer! It's just a gimmick, really, because if it started becoming popular, I'd have to take it down. Even if Google didn't take me down first, who'd want to waste much bandwidth on such a gimmick? Calling a shell or forking doesn't come cheap either.
I wonder what Google uses this /ie interface for? Are they likely to keep it going? It seems to me that getting the raw data like that, without ads and without all that complex XML and SOAP stuff, is just too tempting for Google to keep it going, now that they're getting so heavy into piling new stuff onto their SERPs.
As Chiyo says, it's for IE's search pane.
I think I'd have to limit searches from any single IP number to ten per hour. Wouldn't want any SEOs using it for their own purposes!
Open a socket to www.google.com and make this request:
GET /ie?q=viagra&hl=en&num=100&lr=&ie=ISO-8859-1 HTTP/1.0\n\n
The first 100 hits for Viagra come back in under a second. Zero handshaking required. The headers arrive on top, with the cookie, but ignore them.
The 100 hits are nicely set off on their own lines, but if you use the line terminations, you'll want a fairly big buffer (maybe 4K) if you're using a language that is subject to buffer overflow, in case some fancy tables arrive without terminations some day.
The hits would not be difficult to parse.
The reason Wget got a "Forbidden" must have been because it was sending a User-agent. Can't think of any other reason.
I guess the main difference between Topclick and your service would be that Topclick was willing to pay for the searches that they did? ;)
>maybe not enough people are willing to pay for privacy?
There is a market there, no question, sure people will pay but it has to be a small charge. If there is one company that could develope a profitable revenue stream then imho it is Google. It would be a good trial for a subscription based ad free Google, why not throw a weeks programming at it and see if it fly's?
Got to be better than Google answers surely?
I guess the main difference between Topclick and your service would be that Topclick was willing to pay for the searches that they did?
Another difference is that we're not trying to be a serious alternative to web searching, and we're not expecting people to pay for anything. We're just trying to make a modest point with a very modest gadget.
You can legally deduct our proxy searches from Google's taxes. You can also opt out by placing a META YOU-HAVE-ZERO-PRIVACY-ANYWAY in the header of all SERPs. (Credit is due here to Scott McNealy, and to your META NOARCHIVE opt-out model.)
I'm finishing up our interface within a couple of weeks, and then I plan to install this proxy on our site with an obscure inside link. I'm limiting it to 10 searches per IP number per hour. If I get more than 50 different IPs per day, I'll put additional limits on it. I think your bandwidth can handle this. Our new site gets very little traffic; it's mainly intended as a point of reference for any journalists or bloggers who might be interested in the issues raised.
I don't buy the cookie disabling argument. It's too much trouble to click seven times in Explorer to disable cookies for a Google search, then click seven times to enable them to read the New York Times, then click seven times for another Google search, and then click seven times to log into WebmasterWorld.
But cookies are only half of the problem. The other half is that you log the search terms. Most likely they're logged in two places by Google -- in the httpd log because it's a QUERY_STRING, and in the log that you maintain for the unique cookie ID numbers and IP addresses.
On our proxy, the search terms will be received by POST instead of GET, and will never even touch the surface of our hard disk. Additionally, we have a policy of destroying our httpd logs after they are 60 days old.
By the way, I'm rather fond of your /ie interface. You accidentally did the right thing on this one. You should consider it a service that you're providing for other future proxies out there, rather than a threat to your ad revenue stream. The point at which you perceive it as a threat, is the point at which you've turned the corner from being primarily a search engine, to being primarily an ad agency that lifts nonprofit content for backfill in order to support your revenue stream. I don't believe you've reached that point yet, but time will tell.