Forum Moderators: open

Message Too Old, No Replies

They want it faster!

Google striving for realtime updates.

         

jtoddv

8:44 pm on Jul 15, 2002 (gmt 0)

10+ Year Member



Today, I noticed the link under the search field on Google said:

You're brilliant. We're hiring.

click on it. next page 3rd bullet under:

Large-scale computer systems problems, such as:

Developing algorithms and heuristics to keep our index up to the minute by finding and reindexing almost all web pages within minutes of when they change or they are created.

What will SEO be like if Google ever updates in complete realtime?

JamesR

9:00 pm on Jul 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



whoa, I missed that when I first read it. Will this be the reincarnation of Infoseek?

chameleon

9:38 pm on Jul 15, 2002 (gmt 0)

10+ Year Member



Imagine the stress that will put on every web server in the world! That would mean that they would be polling every web site every few minutes.

A lofty goal, but one I doubt they'll ever reach...

Beachboy

9:48 pm on Jul 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Realtime spidering and indexing is going on right now at gigablast.com. Fun to play with that.

chiyo

10:17 pm on Jul 15, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm.. pay for instant crawl on the horizon?

enotalone

10:25 pm on Jul 15, 2002 (gmt 0)

10+ Year Member



chameleon, I belive updating more often or on a daily bases or even horly does not mean you have to crawle every single page on your database or on the WWW. I would be childish. The reality i am sure they will use logic to determine every pages average update rate etc. and crawl it according to that measures. Otherwise, it won't only cause server loads but energy waste for googlebots as well. Why crawle a page hourly if it never changed from 1995? and why dont crawly CNN hourly?

mbauser2

10:27 pm on Jul 15, 2002 (gmt 0)

10+ Year Member



Imagine the stress that will put on every web server in the world! That would mean that they would be polling every web site every few minutes.

*Sigh*

Google didn't say they want to poll every site constantly, they said they want algorithms and heuristics to predict which sites need to be polled more often. Algorithms and heuristics use input to make decisions; what Google wants is a reasonably accurate way to identify which pages are changing more often than average (and/or which pages change at set times, like "every monday"), so it can visit them more often than average.

The idea isn't complex; the implementation is. If every server in the world actually issued proper "Last Modified" headers, the solution would be trivial, if somewhat resource-intensive: Keep track of modification dates during monthly spiderings, figure out which pages are modified every single month, then start visiting those twice a month. If any of those pages appears to be updated twice a month, visit them three times a month. And so on, until the number of spiderings per month approximately equals the number of updates. Throw in a heuristic to spot patterns like "every Monday" or "every weekday", and you've got the magic engine everybody wants.

Realistically, most sites that update often enough to get spidered "every few minutes" would be sites that are updating every few minutes (like CNN), and probably getting so much traffic that one more bot won't hurt them.

But, unfortunately, not every server in the world sends out useful headers, so the solution will be more complex. That's why Google needs people smarter than you and I.

Axacta

10:43 pm on Jul 15, 2002 (gmt 0)

10+ Year Member



>finding and reindexing almost all web pages within minutes of when they change or they are created<

How are they going to know when any given site will add a new page, and be able to index it within minutes? This would take continuous spidering of "almost all" sites. Maybe they meant pages submitted. Or maybe this is just advertising copy, and we should not read too much into it.

ciml

12:16 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



From "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Lawrence Page:
Another area which requires much research is updates. We must have smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled.

That was 1998. I'm surprised that there hasn't been more movement in this direction over the last four years.

Why do I have pages fetched every couple of days, even though they haven't changed this year?

Maybe GoogleGuy and his colleagues are working on this problem right now?

Marcia

12:39 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Brilliant! Either it's true or it's been concocted as yet another ploy by the Google PR people to keep them in the limelight constantly.

>What will SEO be like if Google ever updates in complete realtime?

It would be total SEO madness, constant freaking and tweaking. We'd all be sitting watching the update in one browser window while the searches shuffled and making changes to pages in the website control panels by the minute. Using FTP would be too slow.

agerhart

12:41 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It would definitely be crazy, but it would be alot more fun and it would make all of our theories and strategies alot more testable. It would also force some of the other search engines to follow the same suit and improve their indexing and refreshing rates.

chris_f

12:51 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with mbauser2.

I've just had a thought. They could use the Google Toolbar to help. We already know the Toolbar phones home so think of this.

1. You have the Google Toolbar installed.
2. You visit a site.
3. Google sends the last updated date (from the pages httpheader) to Google.
4. If the page is newer than the one indexed then the page is updated in the index.

I could code this for them in under a day. I've already coded a similar application to monitor my sites.

Chris.

jtoddv

1:05 pm on Jul 16, 2002 (gmt 0)

10+ Year Member



chris_f

Apply brother, if you can do it. Maybe you are the one they are looking for. The "Chosen One".

chris_f

1:26 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You wouldn't believe the app I've built based around the Google Toolbar.

1. It lists the domain PR
2. It lists all the ages of my site (with PR)
3. It tells a history of when Google last crawled the page
4. It monitors PR changes on the page and domain
5. It alerts me of pages which haven't been updated within a set time.

6ish. I'm testing another function whereby I can get the PR value from www2 and www3.

It's the only way I can manage my sites. However, although I have tried I can't seem to get it working on any machine other than the on I've developed it on.

>>> Apply brother, if you can do it. Maybe you are the one they are looking for. The "Chosen One".

Three problems,
1. No spare time if go back to a full time job (I like my freelance work)
2. I'm in the UK and don't want to emigrate
3. You should never meet you God ;).

Chris.

mack

1:57 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Go on admit it...

We would all miss the update far to much!

It's like being a kid waiting for christmas.

caine

2:06 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



then slowly christmas is not so important, and you're still thinking good or bad thoughts towards your own or your competitors sites for the various relevancies that google has attributed.

If the crawl/update can occur at a level of frequency, that is very beneficial to either SEO driving a site to number 1, or too Googles, Index becoming more so relevant.

Though with Marcia, i am more inclined to believe that this is a PR bash at the limelight. From my point of view a media rebuke to Fasts 7-10 day reindexing announcement.

My only concern, if the situation becomes a reality, is that assuming no more crawl capacity is used except the current amount, that extremely frequent updating sites, will swallow up all the crawls, leaving less frequent sites, to die a slow but timely death.

chris_f

2:30 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Go on admit it...
We would all miss the update far to much!

Too true.

Chris.

Axacta

3:18 pm on Jul 16, 2002 (gmt 0)

10+ Year Member



It seems to me that if Google doesn't really like SEO that much, they would just exacerbate what they see as a problem. Sort of counterproductive, in a way.

Digimon

5:37 pm on Jul 16, 2002 (gmt 0)

10+ Year Member



Hey Chris if you get to do work your aplication in other machines, keep in touch with me!

chris_f

5:48 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Trust me. As soon as it works I post here.

ggrot

6:10 pm on Jul 16, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Actually, Im pretty sure using the toolbar is what they plan on doing. If I remember right, there was some variable I saw in the PR xml document at one point that seemed like a hash value on the html of the page in the index. The toolbar could easily calculate it's own hash value and compare this to the one it downloads, sending a 'needs a respidering' message to google if the page has changed. It didn't seem to actually phone home this information at the time though.

This however leads to different problems. Namely, what if the page is unique for the visitor, ie: it displays the visitor's ip on the page or uses a cookie value to look up their name. That is the trick that google has to overcome.