Forum Moderators: open
You're brilliant. We're hiring.
click on it. next page 3rd bullet under:
Large-scale computer systems problems, such as:
Developing algorithms and heuristics to keep our index up to the minute by finding and reindexing almost all web pages within minutes of when they change or they are created.
What will SEO be like if Google ever updates in complete realtime?
Imagine the stress that will put on every web server in the world! That would mean that they would be polling every web site every few minutes.
*Sigh*
Google didn't say they want to poll every site constantly, they said they want algorithms and heuristics to predict which sites need to be polled more often. Algorithms and heuristics use input to make decisions; what Google wants is a reasonably accurate way to identify which pages are changing more often than average (and/or which pages change at set times, like "every monday"), so it can visit them more often than average.
The idea isn't complex; the implementation is. If every server in the world actually issued proper "Last Modified" headers, the solution would be trivial, if somewhat resource-intensive: Keep track of modification dates during monthly spiderings, figure out which pages are modified every single month, then start visiting those twice a month. If any of those pages appears to be updated twice a month, visit them three times a month. And so on, until the number of spiderings per month approximately equals the number of updates. Throw in a heuristic to spot patterns like "every Monday" or "every weekday", and you've got the magic engine everybody wants.
Realistically, most sites that update often enough to get spidered "every few minutes" would be sites that are updating every few minutes (like CNN), and probably getting so much traffic that one more bot won't hurt them.
But, unfortunately, not every server in the world sends out useful headers, so the solution will be more complex. That's why Google needs people smarter than you and I.
How are they going to know when any given site will add a new page, and be able to index it within minutes? This would take continuous spidering of "almost all" sites. Maybe they meant pages submitted. Or maybe this is just advertising copy, and we should not read too much into it.
Another area which requires much research is updates. We must have smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled.
That was 1998. I'm surprised that there hasn't been more movement in this direction over the last four years.
Why do I have pages fetched every couple of days, even though they haven't changed this year?
Maybe GoogleGuy and his colleagues are working on this problem right now?
>What will SEO be like if Google ever updates in complete realtime?
It would be total SEO madness, constant freaking and tweaking. We'd all be sitting watching the update in one browser window while the searches shuffled and making changes to pages in the website control panels by the minute. Using FTP would be too slow.
I've just had a thought. They could use the Google Toolbar to help. We already know the Toolbar phones home so think of this.
1. You have the Google Toolbar installed.
2. You visit a site.
3. Google sends the last updated date (from the pages httpheader) to Google.
4. If the page is newer than the one indexed then the page is updated in the index.
I could code this for them in under a day. I've already coded a similar application to monitor my sites.
Chris.
1. It lists the domain PR
2. It lists all the ages of my site (with PR)
3. It tells a history of when Google last crawled the page
4. It monitors PR changes on the page and domain
5. It alerts me of pages which haven't been updated within a set time.
6ish. I'm testing another function whereby I can get the PR value from www2 and www3.
It's the only way I can manage my sites. However, although I have tried I can't seem to get it working on any machine other than the on I've developed it on.
>>> Apply brother, if you can do it. Maybe you are the one they are looking for. The "Chosen One".
Three problems,
1. No spare time if go back to a full time job (I like my freelance work)
2. I'm in the UK and don't want to emigrate
3. You should never meet you God ;).
Chris.
If the crawl/update can occur at a level of frequency, that is very beneficial to either SEO driving a site to number 1, or too Googles, Index becoming more so relevant.
Though with Marcia, i am more inclined to believe that this is a PR bash at the limelight. From my point of view a media rebuke to Fasts 7-10 day reindexing announcement.
My only concern, if the situation becomes a reality, is that assuming no more crawl capacity is used except the current amount, that extremely frequent updating sites, will swallow up all the crawls, leaving less frequent sites, to die a slow but timely death.
This however leads to different problems. Namely, what if the page is unique for the visitor, ie: it displays the visitor's ip on the page or uses a cookie value to look up their name. That is the trick that google has to overcome.