Forum Moderators: bakedjake

Message Too Old, No Replies

Gigablast now searches more than 200M pages

         

takagi

3:58 am on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google and AlltheWeb only change the number of indexed pages on their home page a few times per year (AlltheWeb updated it in March and August, Google last changed it November last year). On www.gigablast.com you can see a fresh number everytime you visit the home page. Today I saw for the first time a value bigger than 200 million:

200,129,632 pages indexed

and still counting. Sure it is much smaller than ATW or Google, but with the limited resources Matt Dwells can be proud on it.

moltar

4:53 am on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Way to go Matt! :)

msr986

6:42 am on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's nice to see a search engine grow, but I have noticed that Gigablast is not updating the SERP's like it used to.

I've got a site that was completely changed about 6 months ago, including the links and directory structure.

The main index page has been updated, but not any of the interior pages.

I remember 'the good old days' when adding a url got your site completely spidered almost immediately!

I've got pages that are showing an index date that is over a year old! :(

takagi

8:20 am on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google last changed it November last year

Just today Google updated the home page:

Searching 3,083,324,652 web pages

into:

Searching 3,307,998,701 web pages

I've got pages that are showing an index date that is over a year old!

Matt Dwells stated that the system was designed for 200 - 250 million pages, so my guess is, the focus will soon change to respider the old pages.

moltar

3:56 pm on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



msr986: if you submit every page manualy it will spider it right away.

takagi

4:27 pm on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> I've got pages that are showing an index date that is over a year old! :(

You may notice pages that have index dates from a long time ago, but that may mean that the spider visited them recently and found them unchanged so it did not reindex them. To make things less confusing, I may soon change the index date to a last visited date, ..
Matt Dwells in message 16 of gigablast management [webmasterworld.com]

In the same message he also wrote:

With only about 1.5Mbps of bandwidth, $8k of hardware and while serving 500,000 queries per day it is challenging to keep a two hundred million page index fresh ..

Brad

5:01 pm on Aug 26, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



200M is pretty darn good for a 3 man search engine company. Consider that AV only has about 500 - 600M pages.

If Gigblast can switch to having a "and" statement as default instead of "or" they will really be something. Hats off to them for some wonderful progress.

mattdwells

12:43 am on Aug 27, 2003 (gmt 0)

10+ Year Member



Brad,

It is a popular misconception that Gigablast is default OR.
In reality, Gigablast is default AND and default OR combined. You get the best of both worlds. Default AND results are always displayed before the default OR results.

These two sets of results are separated by a clearly displayed blue bar. This way is better than regular
default AND because if you misspell a word or
enter a long query that has no results there's a good
chance you will get something relevant back without
having to do anything else.

btw, Gigablast should be a little faster now since
I finally upgrade most servers to kernel 2.4.21 (right before 2.4.22 came out, sigh...) So now it doesn't swap out my processes for absolutely no apparent reason. yoo-hoo! i've noticed good speed increases as a result.

Matt Wells

moltar

12:51 am on Aug 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great! Very noticeable increase in speed. I just used it the other day and response time was about 1-2 seconds. I don't know if that was because of heave traffic at the time or that's how it always was, but today is definitely faster!

Thank you Matt for your great service!

mattdwells

12:53 am on Aug 27, 2003 (gmt 0)

10+ Year Member



takagi,

yes indeed there are some old pages still in the index. it is my top priority to take care of that asap. i am currently working on some ways to increase the spider rate by about a factor of 10.

speaking of spidering, it is interesting to note that some financially-richer search engines seem to be following Gigablast's lead in the field of continuous index updating. Gigablast, to the best of my knowledge, was the first search engine to continually refresh its entire index automatically. I consider this to be one, if not the, most complicated technologies in a modern day search engine. There are many many details you have to worry about to make it work and there are many many more things that could go wrong and have disastrous consequences. Once my competitors have a document count that changes in real-time we'll know they've pulled it off, but until then, it's probably not true continuous updating.

matt wells

Brad

1:01 am on Aug 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Matt,

I'm glad to be corrected on that. :) You are on to something very good here.

Chndru

1:05 am on Aug 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



These results were cached 174 minutes ago. [Info]

On the bottom of a SERP this appeared. Looks kinda cool :-)

mahlon

1:40 am on Aug 27, 2003 (gmt 0)

10+ Year Member




Ahh, no Amazon results either ;)

Visit Thailand

1:51 am on Aug 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How do we stop GigaBlast caching the pages?

I do not want to add the robots no archive tag as I do want some to archive the pages like the way back machine.

But I do not want G and Gigablast etc caching the pages?

Conrgats on the 200M!

sidyadav

7:50 am on Aug 27, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I use Gigablast almost as "my own" search engine, as it spider's the URL in seconds, I added many sites that weren't there before, This way is a good way of helping Gigablast even if you don't have you're own site!

mattdwells

7:27 am on Aug 29, 2003 (gmt 0)

10+ Year Member



VT,

[gigablast.com...]
tells how to prevent the caching of your pages.

Matt Wells

Visit Thailand

8:29 am on Aug 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Matt. I was going to post here but instead decided I would start another thread about that here:

[webmasterworld.com...]

I do like GigaBlast but am a little disillusioned by all the caching going on by all SE's.

papamaku

8:40 pm on Aug 30, 2003 (gmt 0)

10+ Year Member



VT,

how come you are so against caching?