Future of the web: Smart Browsing - Website Technology Issues forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Future of the web: Smart Browsing

A spider on every desktop?

webdevsf

3:59 pm on Jun 16, 2003 (gmt 0)

10+ Year Member

The browser is relatively dumb right now. You click a link and it shows a page. There are a few browsers (I think one version of Mozilla) that are extending the browsing concept by preloading pages from a site. Most of the work done on browsers has been to show fancier and fancier pages.

Why can't this preloading concept extend to the whole site? And then if it extends to the whole site, why can't it extend to every site? Ie, why can't I just leave my PC on and let it surf all the sites I like all the time. And then just surf other sites according to the same algo that google uses.

It creates my own personal index and a page rank according to my own preferences, and I could give a care less about what the engineers at google think I want. (Sorry googleguy) These absurd "named" google index updates will not matter. Index updates will happen when I sleep or am on the phone. They'll pause when I have to finish what I'm doing on my PC.

I know that Google has 56,000 linux boxes in a cluster or whatever, but when I have 2 Terabytes of disk in 5 years, and i have a 10000Ghz Pentium 7 with a big fat Internet Pipe, why can't I just do this on my own? It might not contain every site in the world, but the truth is, I doubt that I need that anyway.

Google will still have a place in the world, but the desktop will return as the source of power, and you won't have to rely on somebody else's algorithms to find the pages you want.

BlobFisk

4:27 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Sounds like you're advocating the idea of a Semantic Web, where people would run their own agents from their own machines!

Brett_Tabke

4:36 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Note to self: order more bandwidth and send bill to members.
Note to self2: study those Perfect htaccess ban list threads closer.

webdevsf

5:08 pm on Jun 16, 2003 (gmt 0)

10+ Year Member

I just want to run the GoogleBot from my machine. :)

DrDoc

6:25 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I know that Google has 56,000 linux boxes in a cluster or whatever, but when I have 2 Terabytes of disk in 5 years, and i have a 10000Ghz Pentium 7 with a big fat Internet Pipe, why can't I just do this on my own?

Because by then there will be 999.999.999.999.999.999 pages...

Why would you want to do that though? Personally I think it would suck to get a bunch of sites I'm really not interested in.

Also, it would suck to be the owner of a site since I could impossibly tell if it was a real page view, or just a dumb download.

jdMorgan

6:40 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ugh. I'm with BT on this one...

Notes to self:

Write up and post access-by-permission-only user-agent blocking code now in testing.

Check with KM again next week to see how his 'bot-control project is going.

Jim

webdevsf

7:13 pm on Jun 16, 2003 (gmt 0)

10+ Year Member

I'm just saying - I don't see how you are going to tell the diff between a bot and a browser. Why shouldn't i be able to download your whole site? I may want to read parts of it. Heck, I might even buy from you.

I'll get by seeing lots of sites I don't want by having an index and searching that the same way I search google. I'll have an automated routine that tosses (immediately) stuff that I don't think is releveant.

The problem is that there's a lot of entrenched experts who think that the website should define the user epxperience, instead of the user.

Like it or not, information is a commodity and bandwidth gets cheaper by the day. The htacess lists won't work for very much longer. There are a lot of distributed projects on the web and it'll be easy to get around almost any kind of ban: IP, User Agent, whatever, with massively distributed, p2p technology.

john316

7:22 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Well...someone could just create an "open index" and allow queries. If I value my privacy I might even pay a small fee to access it, or the cost of access is already bundled into the desktop search application.

Search engines are stuck right now not because of technology, but because of social issues, the big barrier is do I trust them with my personal info?

webdevsf

7:36 pm on Jun 16, 2003 (gmt 0)

10+ Year Member

The problem with a central index is that the structure of the index affects the results. So if i think PageRank is stupid, i want to substitute MyNewFancyRank instead. But that stuff is deep deep in there - ie, when scanning the index for results, it checks only the top ranked stuff. It can't go thru the entire resultset or else searches would take forever.

So an open index wouldn't really solve it. You'd need to have the desktop app be able to build the index itself to get different kinds of search results.

Somebody could easily build an add-on to mozilla to index sites every day using a custom filter and have a special search box.

john316

7:48 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You could utilize more than one search service. I was looking at something like this other day and although it is slow and kludgy, if it were refined,the concept would turn search engines into commodity search suppliers.

[snowtide.com...]

Zap the link if you like, I have no interest in the company other than the fact that I think it may be relevant to the thread.

jdMorgan

7:53 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

webdevsf,

The concept of a web search spider on every desktop (and even every laptop) is a sure-fire recipe for an internet meltdown; the internet is not infinitely-scalable at no cost. In most cases, my access rules and scripts *can* tell the difference between a robot and a surfer, and since I pay the bandwidth bills, I *will* continue to decide what is abusive and what is not. Even in the case of a perfectly-implemented spider with human behavioural traits, the very least I'm going to do is throttle its resource access rate to something sustainable.

I'm am all for an open web. I am not, however, in favor of an open bandwidth-overage-fee pipe to my wallet.

The site where my restrictions are tightest is a non-commercial site, and so is very sensitive to hosting costs.

I think any responsible 'bot programmer or scripter should spend a moment thinking about the effects their creation will have if multiplied a thousandfold or several millionfold. If it still seems responsible to release the code after that thought experiment, then fine, release it. But don't blame me for blocking it if the thinking wasn't quite deep enough, and the 'bot provides no benefit for my visitors or my site in return for the bandwidth it uses.

MHO,
Jim

martinibuster

8:43 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

webdevsf has a good idea- This is something that many search industry leaders have been considering: Personalized Search.

The Search Engine that Can Read Your Mind
A search engine that is customized to your sex, geographic location, age, interests, hobbies,etc. I heard a presentation by Yahoo! in which this was discussed. Although desktop implementation was not mentioned specifically, a desktop search engine would be the ideal way to provide a highly accurate search experience.

The Google, Yahoo!, AskJeeves Toolbars are but a primitive implementation of desktop search. Naturally, privacy issues would have to be addressed. But if millions of users can be persuaded to use toolbars and MS Passport, then having a personalized Google Search Engine on your desktop may not be that great of a leap for most people.

Desktop Search: Is it really so far-fetched?
Desktop search could be the next wave 5-10 years down the road. For example, Microsoft is already integrating search into it's next operating system, codenamed Longhorn. Data storage and file retrieval is a big part of the next OS: seamlessly finding stuff in your hard drive or on the web is a high priority of Longhorn. As far as I know, this isn't a true desktop search engine, but it's a step in that direction.

I don't see bandwidth issues as much of a problem if you limit the amount of the spidered web to the users interests. For instance, if you are interested in Baseball, why would you need a copy of the USDA web site?

jeremy goodrich

9:08 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Think about Grokker by Groxis - think about the research that was done in the late 1990's by at least 1 university about an agent that 'surfed with you' to learn what type of pages you like, and which you don't.

Know consider that - the tool john316 pointed out is just one of many, that are available right now and provide powerful aggregation & filtering mechanisms for more 'personalized search'.

One of the biggest problems with this - aside from the bandwidth cost potential, as has been pointed out several times - is simply:
a lack of diverse data on the net. Sure, you can get a set of personalized 'widget results' but what if there are only 10K pages with that word string? Even when you do a 'find all words in any document' instead of 'find this literal string search'...

this happens quite often. Also, people (joe public...) still don't even use the advanced search page on most engines and those are way more powerful than the default options.

It will take first, a 'new revolution' in user sophistication for people to 'buy in' to these ideas, imho, before there is the market demand for them.

DrDoc

10:59 pm on Jun 16, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Personalized search - yes
billions of search engines - no

You might want to buy from me? Oh, too bad I run a non-profit Web site that has a ton of nifty information on it then, eh?

The idea is good, but it would never work in reality.

What about a freshly installed computer? Should it be shipped with the entire Internet on it? You would initially need a search engine, or else you would have to wait a week after connecting to the net before you can start using it.

Also, why would you want to do this on every single computer when there are centralized services that don't need a few years to come up with something like this - without you having to pay for it. Part of the problem is technology - it is difficult to make such a search engine efficient enough. But, trust me, it will come!

john316

12:21 pm on Jun 17, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I could see some major OS vendors coming up with substantial indexes and APIs similiar to google and offering them to the developer community as "value added".

That would put search on the desktop quickly.