Forum Moderators: open

Message Too Old, No Replies

Running a search Proxy?

         

Everyman

1:39 pm on Aug 5, 2002 (gmt 0)



I like the /ie interface. I'm thinking that I can create a very simple Google proxy for my site. It wouldn't get more than 50 users a day, I predict, but it would draw attention to the privacy issues I'm trying to highlight. I doubt that Google would block me, because that would only give my issues more attention. I'm nonprofit, zero budget, and just want to raise some issues.

With the /ie interface, you can still add num=100 in the search command line and get 100 results instead of 10. That way you could even strip out the "Next" link. The snippets are there for your use, if you want to parse them out for non-IE users and highlight the keywords. I would strip out the Google logo, as well as their plug for their toolbar.

(This is not much different than what Google does when they cache our pages. They slap their own brand on their copy of our stuff, alter the coding, and recommend a bookmark that keeps folks away from us forever.)

An extremely simple interface would be to shell to a "lynx -source" dump to a file, and then read this file, parse out what you need, and spit it back out to the searcher. If you fork from your Linux CGI program to use lynx, you may need to escape the '&' in the long URL that you pass to lynx. Of course, you can just open a socket and parse on the fly, assuming that you are eager to figure out all that handshaking that even a browser like lynx probably does to make Google happy. (I tried a wget dump but got a Forbidden from Google.)

Also, delete the .lynx_cookies file first, if lynx is set up to accept cookies, so that Google has to issue a new ID for every search.

No advertising! No cookie for the user! No cache links! -- I like that, because I'm opposed to Google's cache, although a lot of searchers wouldn't like it. I would use POST instead of GET for the user's search terms, so that the terms don't end up in my httpd log, and I can brag that we do no logging of search terms.

A Google adbuster and anonymizer! It's just a gimmick, really, because if it started becoming popular, I'd have to take it down. Even if Google didn't take me down first, who'd want to waste much bandwidth on such a gimmick? Calling a shell or forking doesn't come cheap either.

I wonder what Google uses this /ie interface for? Are they likely to keep it going? It seems to me that getting the raw data like that, without ads and without all that complex XML and SOAP stuff, is just too tempting for Google to keep it going, now that they're getting so heavy into piling new stuff onto their SERPs.

chiyo

1:43 pm on Aug 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I think the ie interface is for one of the IE 5.5 "features" - sidebar or something like that.

ciml

3:01 pm on Aug 5, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I really would not advise that Everyman. As Chiyo says, it's for IE's search pane.

Everyman

4:46 pm on Aug 5, 2002 (gmt 0)



As Chiyo says, it's for IE's search pane.

Well then, shame on DMOZ [dmoz.org] for listing this interface as "Google Results - Bare-bones interface which returns only page titles." There are a number of Google software spinoffs listed on that DMOZ page. I agree that if it were a serious alternative to Google's interface and could handle serious traffic, Google would take action. But if it's done to illustrate a point and has fewer than 50 users a day, and is done by a recognized nonprofit with a record of pursuing privacy issues, I very much doubt that Google would retaliate.

I think I'd have to limit searches from any single IP number to ten per hour. Wouldn't want any SEOs using it for their own purposes!

Everyman

6:32 pm on Aug 5, 2002 (gmt 0)



Wow, this is easier than I thought it would be. No lynx required.

Open a socket to www.google.com and make this request:

GET /ie?q=viagra&hl=en&num=100&lr=&ie=ISO-8859-1 HTTP/1.0\n\n

The first 100 hits for Viagra come back in under a second. Zero handshaking required. The headers arrive on top, with the cookie, but ignore them.

The 100 hits are nicely set off on their own lines, but if you use the line terminations, you'll want a fairly big buffer (maybe 4K) if you're using a language that is subject to buffer overflow, in case some fancy tables arrive without terminations some day.

The hits would not be difficult to parse.

The reason Wget got a "Forbidden" must have been because it was sending a User-agent. Can't think of any other reason.

GoogleGuy

5:58 pm on Aug 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Everyman, Topclick was a partner of ours. They did a google search without a cookie, without banner ads, etc. Topclick had a strong privacy emphasis as well. They went bankrupt though; maybe not enough people are willing to pay for privacy? That's a shame, in the same way I lament that SafeWeb and ZeroKnowledge backed away from their privacy-protecting systems. I guess not many people are willing to pay for privacy. That's why I'm glad that you can get privacy-protected searching from Google just by disabling cookies. Google doesn't allow third-party banners, images, or web bugs that would allow other companies to track our user's searches.

I guess the main difference between Topclick and your service would be that Topclick was willing to pay for the searches that they did? ;)

NFFC

7:22 pm on Aug 6, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



roflmao, great thread, nice to see you "two" getting along so well :)

>maybe not enough people are willing to pay for privacy?

There is a market there, no question, sure people will pay but it has to be a small charge. If there is one company that could develope a profitable revenue stream then imho it is Google. It would be a good trial for a subscription based ad free Google, why not throw a weeks programming at it and see if it fly's?
Got to be better than Google answers surely?

Everyman

8:44 pm on Aug 6, 2002 (gmt 0)



I guess the main difference between Topclick and your service would be that Topclick was willing to pay for the searches that they did?

Another difference is that we're not trying to be a serious alternative to web searching, and we're not expecting people to pay for anything. We're just trying to make a modest point with a very modest gadget.

You can legally deduct our proxy searches from Google's taxes. You can also opt out by placing a META YOU-HAVE-ZERO-PRIVACY-ANYWAY in the header of all SERPs. (Credit is due here to Scott McNealy, and to your META NOARCHIVE opt-out model.)

I'm finishing up our interface within a couple of weeks, and then I plan to install this proxy on our site with an obscure inside link. I'm limiting it to 10 searches per IP number per hour. If I get more than 50 different IPs per day, I'll put additional limits on it. I think your bandwidth can handle this. Our new site gets very little traffic; it's mainly intended as a point of reference for any journalists or bloggers who might be interested in the issues raised.

I don't buy the cookie disabling argument. It's too much trouble to click seven times in Explorer to disable cookies for a Google search, then click seven times to enable them to read the New York Times, then click seven times for another Google search, and then click seven times to log into WebmasterWorld.

But cookies are only half of the problem. The other half is that you log the search terms. Most likely they're logged in two places by Google -- in the httpd log because it's a QUERY_STRING, and in the log that you maintain for the unique cookie ID numbers and IP addresses.

On our proxy, the search terms will be received by POST instead of GET, and will never even touch the surface of our hard disk. Additionally, we have a policy of destroying our httpd logs after they are 60 days old.

By the way, I'm rather fond of your /ie interface. You accidentally did the right thing on this one. You should consider it a service that you're providing for other future proxies out there, rather than a threat to your ad revenue stream. The point at which you perceive it as a threat, is the point at which you've turned the corner from being primarily a search engine, to being primarily an ad agency that lifts nonprofit content for backfill in order to support your revenue stream. I don't believe you've reached that point yet, but time will tell.