Gridzoom Throws Down the Gauntlet

Forum Moderators: open

Message Too Old, No Replies

Gridzoom Throws Down the Gauntlet

gridBot/0.3alpha

pendanticist

9:19 pm on Oct 19, 2004 (gmt 0)

209.123.8.** - - [19/Oct/2004:11:52:21 -0700] "GET /robots.txt HTTP/1.1" 200 1705 "-" "gridBot/0.3alpha (+ [gridzoom.com...]
209.123.8.** - - [19/Oct/2004:11:52:21 -0700] "GET / HTTP/1.1" 200 20402 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

From their Mission Statement:

- So what will make Gridzoom different?
First of all, our spider rates a site by certain quality factors as a human reader would precieve them. Sites with no other purpose than getting at your wallet will be ranked respectively. Then sites will be ranked only by those factors a webmaster cannot influence. Of course, in an ideal world search engines with sophisticated full text algorythms would deliver better results. Unfortunately, this isn't an ideal world, but a web full of cheaters looking for a quick buck. We are out to clean up that mess.

Let the games begin!

jmccormac

9:55 pm on Oct 22, 2004 (gmt 0)

Since the IP is temporary and the final set of IP addresses will be hard to figure out, you'd be better off using the robots methods to block the bot.

This is the part that is rather confusing. Unless Gridzoom is going to use dialup IP space, it will be extremely simple for some people to track down its IPs.

The other aspect is the question of Gridzoom's algorithm. the talks of a "white paper" on the algorithm it is using sounds impressive. However based on what I've read here, it appears that Gridzoom is just using pre-selection and as such is moving along the SE/Directory path. The pre-selection, at a guess, is probably using Regexps targeted at linkswamps and directories generated from SERPs. This is not a difficult thing to do.However will this give Gridzoom the edge over Google/Yahoo/MSN?

Regards...jmcc

Thunderstorm

1:39 pm on Oct 23, 2004 (gmt 0)

The penalizing of bad netcitizens was the controversy here, but it's only part of the concept. It's used to keep certain sites low in the ranking.

But that part of the spider isn't really implemented yet at all, except for a detection of simple-straightforward popups and exit pages.

What really makes gridzoom unique (and imho better than most search engines) is the ranking mechanism, which is a bit like Googles Pagerank used to be: Simple, yet effective. And best of all, it's much harder to influence than the Pagerank. However, I cannot talk about that in detail yet.

jmccormac

1:55 pm on Oct 23, 2004 (gmt 0)

What really makes gridzoom unique (and imho better than most search engines) is the ranking mechanism, which is a bit like Googles Pagerank used to be: Simple, yet effective. And best of all, it's much harder to influence than the Pagerank. However, I cannot talk about that in detail yet.

No system is unbreakable. Google's PageRank system was fundamentally simplistic in that it worked well in theory but it was possible to easily exploit the weaknesses of the system. However it was well above quality the opposition at the time. Gridzoom's process still sounds like a human pre-selector system.

Regards...jmcc

Larryhat

2:25 pm on Oct 23, 2004 (gmt 0)

Hello all: Some simple questions re: Gridzoom.

1) Does/will Grid-zoom have a Submit your Site for Free page?
2) IF so, will we have to copy a code from one of those scrambled schnitzel images to prevent machine spamming?
3) IF IF so, might they disallow zeroes / letters 'O' so those are not confused?
4) IF IF IF so, how about upper and lower case letters? some of those are indistinguishable!

I had a non-commercial and totally white-hat site banned from Altavista because of this once.

Just curious. - Larry

Thunderstorm

2:57 pm on Oct 23, 2004 (gmt 0)

No system is unbreakable.

That is a very popular and also true quote, which completely neglects an important factor: Many systems are extremely hard to break. Yes, the gridzoom system is breakable, but breaking it requires a lot of work and money.

The reason google does not disclose their algorithm is, that they quite well know it's weaknesses. We will publish our algorithm (as soon as it's patented) and invite everyone trying to cheat it, so we can make it better.

There are those people trying to break something for their own gain, but there are also those trying to break it for fun and even those wanting to help improve things with their own ideas. If you are being secretive about your methods, the first of these three groups will be in the majority. Imho that's the same reason why Windows is far more buggy than Linux or FreeBSD. ;)

Does/will Grid-zoom have a Submit your Site for Free page?

Currently we do have such a page, which even immediately triggers the spider. That is mostly for testing purposes though, as it gives away the IP address of the spider.

In the future, there are arguments for and against a submit page:
For one, it helps those webmasters putting effort into promoting their page. The philosophy of GridZoom is to take any influence over the search engine out of webmasters hands, giving those the best ranking who put the most effort into their page.
On the other hand no search engine can repeatedly crawl the whole internet. Even Google only returns often to pages with a high update cycle or a high ranking. So a good site might take quite a while to get into the index, if it isn't linked from a popular site. (In our case, that would be a site with a high update cycle, as we don't measure a sites "importance".) Taking submissions would help with finding new pages quickly.

In other words: We are kinda undecided there. ;)

IF so, will we have to copy a code from one of those scrambled schnitzel images to prevent machine spamming?

I have no idea what you are talking about. ;) But as a basic rule: If a site intentionally does something to influence or break our spider, we will derrive a way to identify the method and automatically penalize sites using them. We will not ban sites individually.
If a site does something like the above for a valid reason, which incidently happens to be a problem for the spider, we'll find a way to work around that.
Of course, if it turns out a serious problem we will probably skip such sites until we can fix things on our end.

jmccormac

4:58 pm on Oct 23, 2004 (gmt 0)

No system is unbreakable.
That is a very popular and also true quote, which completely neglects an important factor: Many systems are extremely hard to break. Yes, the gridzoom system is breakable, but breaking it requires a lot of work and money.

If you have enough experience breaking and making hard to break systems, then you may consider Gridzoom hard to break. (I have a bit of a history of breaking "unbreakable" systems. [1] ;) ). Until then the only time that you can be sure that you have a tough system is when the real SEO players in the field start going after it. The Catch 22 is that gridzoom has to become a significant player in the SE business. The danger is that there may be someone out there that thinks in a totally different way to yourself who sees a gaping hole in your system that could simply be exploited. Google's was that their system was great in a finite situation where money was not at stake. The reality was that once Google became a player, every SEO worth his or her salt was trying to game the algorithm.

The philosophy of GridZoom is to take any influence over the search engine out of webmasters hands, giving those the best ranking who put the most effort into their page.

This effectively puts Gridzoom in an editorial position. The decisions you make effectively determine the "value" of a website in your SE. With roughly 43 million domains in the gtlds, and at a guess 20 million in the cctlds, you've a starting point of 63 million domains. Now only 40 to 50% of these may have associated websites. That's still about 30 million sites. And these sites may not just be using the English language. How does your site value system work in other languages?

There is also another more fundamental problem that I don't think Gridzoom has solved yet - website acquisition. This is almost a science in itself. Most of the search engines that I've seen over the last few years have no clear site acquisition strategies and tend to rely on blind crawling.

On the other hand no search engine can repeatedly crawl the whole internet. Even Google only returns often to pages with a high update cycle or a high ranking.

Deep crawls occur but in a longer timeframe. The Deepcrawl/Fresh crawl strategy is a logical one. However a deep crawl is a necessary part of the update process. Once you have done a few deepcrawls, you can begin to select sites that are active and sites that are just static brochureware that rarely updates in a year. This deep/fresh approach is essential for search engines. It is a bit simplistic to think that the big players do not spider repeatedly.

So a good site might take quite a while to get into the index, if it isn't linked from a popular site. (In our case, that would be a site with a high update cycle, as we don't measure a sites "importance".) Taking submissions would help with finding new pages quickly.

From this it seems that Gridzoom may have something with the overall idea and the execution of the search engine but it has a significant weakness in the acquisition of new sites. I think that the big players have already solved this one.

Regards...jmcc
[1] For some reason, my name appears in the US Patents db but most of the patents seem to be quoting some of this "work" and or coming up with systems to stop it working. :)

Thunderstorm

5:55 pm on Oct 23, 2004 (gmt 0)

If you have enough experience breaking and making hard to break systems, then you may consider Gridzoom hard to break.

A simple system is easy to judge. The very basic idea of the google pagerank was a good idea based on the internet as it used to work. It completely neglected the idea of outside influence. A system using a link as a vote is flawed by definition, as everyone can vote plenty of times. In addition to that, there are some well known methods of derriving the weight of a fulltext search. Google has build countless exclusions and keeps tuning their algorithm, but it boils down to a propagated link weighting interacting with standard fulltext algorithms.

I really cannot tell you what exactly we do yet. But once I can, I will appreciate any constructive criticism. I don't claim to know everything. :)

This effectively puts Gridzoom in an editorial position.

EVERY search engine has to rank sites and with that effectively tries to determine both relevance and quality. Gridzoom does the same thing, just with a different approach.
We DO NOT judge your content in an editorial kind of way - we don't care if your oppinion differs from our own. ;) The whole system is completely automated.

For that same reason, language doesn't matter either.

Our measure of quality has been discussed a lot in the thread. Our measure of relevance is going to be a lot different from current methods.

Deep crawls occur but in a longer timeframe.

Yes, which is sometimes over half a year for google and much longer for other engines. If we can pick up new lonely sites faster by allowing people to submit an URL, it would help.

It is a bit simplistic to think that the big players do not spider repeatedly.

That's not what I meant to say. They don't do it often. Well, they probably do it all the time, but it takes quite a while. We won't really do "deep crawls", and I doubt any other search engine really does them either in the classic sense. We rank sites internally, mostly by their update cycle. Different spiders will take care of different levels. For example, there will be one spider crawling maybe the 10,000 most up to date sites, effectively crawling those almost every hour. The next spider will then take care of maybe the sites ranked 10,001 to 100,000, visiting them every day. So the bottom ranked sites will be visited maybe once every month (at best). Newly discovered sites will be added with a high update value, but will then drop if they aren't being udpated.
I daresay there isn't one worthwhile page on the internet, which isn't linked anywhere. We are considering dedicating one spider to finding new sites from zone files.

Site acquisition really isn't a problem. Once we have an index the size of google and enough spare storage to grab the missing 10% we'll start thinking about it again. ;)

but it has a significant weakness in the acquisition of new sites. I think that the big players have already solved this one.

We don't have the answer to everything yet, but every problem can be solved. Except for Google I don't think any other search engine is capable of acquiring new sites in a timely matter. I have somewhat new sites, which were in the google index after a day, which Yahoo hasn't even picked up after 6 months. They did spider linking sites plenty of times in the meantime. Being able to update your index in a timely matter is much more important.

jmccormac

6:28 pm on Oct 23, 2004 (gmt 0)

For example, there will be one spider crawling maybe the 10,000 most up to date sites, effectively crawling those almost every hour.

Even taking about 100KB of webpages from each of these sites would result in approximately 1GB of data per hour. That's a lot of bandwidth.

The next spider will then take care of maybe the sites ranked 10,001 to 100,000, visiting them every day.

Approximately 90K sites. Again applying the approximate 100KB of webpages from each of these sites would result in a bandwidth requirement of approximately 72Gbits a day.

We are considering dedicating one spider to finding new sites from zone files.

That is a lot more difficult than it sounds. Part of the work I do is generally tracking domain name usage across hosters in the gtlds and specifically identifying and tracking Irish owned domains and websites. It is a complex process and typically involves a resultant database that varies from 30GB to 80GB. For any "fresh" spidering process (and to keep this new spidering to a reasonable level) the detection process would have to run daily. It is not impossible (I run the processes on a weekly basis for the gtlds and daily for .ie).The other problem is that the new sites often do not go active for a while after being detected. Others never go active. In effect you have to split the acquisition/pre-index phase from the SE itself.

Regards...jmcc

Thunderstorm

6:57 pm on Oct 23, 2004 (gmt 0)

Even taking about 100KB of webpages from each of these sites would result in approximately 1GB of data per hour. That's a lot of bandwidth.

Remember, we don't load images when spidering. Very few pages actually have 100kb of HTML. From a quick glance at the log I'd say it's more an average of 20KB/page.

But yeah, we are expecting a couple of terabyte to go into spidering each month. You can't maintain a somewhat up to date index over 100TB of data without expecting to crawl at least 10% worth that every month. This will probably become a lot more in the long run.

While traffic isn't that cheap, it's really not a major cost factor anymore. (About $200 / TB)

That is a lot more difficult than it sounds.

Yeah, it probably is. It's just something we are considering. Right now finding sites to index is really low on our priority list. We have a couple of billions to go first. ;)

While it is said that a robot "follows links", ours really doesn't. The actual spider just reads one page. It is being fed pages by a small script from the database by certain criterias. So anything working from zone files wouldn't trigger the spider at all, but just add new pages to the DB.

But you are right, many new domains probably have only placeholders, if any http service at all.

jmccormac

7:35 pm on Oct 23, 2004 (gmt 0)

Remember, we don't load images when spidering. Very few pages actually have 100kb of HTML. From a quick glance at the log I'd say it's more an average of 20KB/page.

Yes. I do know how spiders work after running a country-level search engine for a few years. :) However if these sites are dynamically generated, and do not have 304 provisions, the spider will end up downloading each changed page hourly. If they have 304 provisions, then the spider will be downloading changed pages (estimating about 5 changed pages per hour). On a busy news or bbs site, then this figure will be a lot higher.

But you are right, many new domains probably have only placeholders, if any http service at all.

I know - this is what I see every day.

Regards...jmcc

Thunderstorm

8:05 pm on Oct 23, 2004 (gmt 0)

We are looking into way to distinguish dynamic and user submitted content from editorial content.

We cannot spider every busy forum on the internet every hour, so we want the top pages to be news pages.

Example: The slashdot main pages would be something to spider periodically, but we really can't index every comment page.

This might need a manual list to work of. Forums will probably get their own spider, so will news sites.

Unlike google we aren't too much interested in blogs either. We will index them, but we don't need to have the oppinion of a few hundred thousand bloggers in the index the moment they post something. Those (few) blogs really relevant to a certain topic will get a good ranking on that topic no matter how often we spider it.

That's really the whole idea: Get the best results for general topics on the first few result pages. Noone will be able to beat googles datamining capabilities for a while. Unless someone here wants to sponsor hardware for 50 million (a few primepower 2500 should do), we'll settle for the 90% of users looking for general info. :)

This 41 message thread spans 2 pages: 41