| 4:35 pm on May 22, 2003 (gmt 0)|
This past month has been very different, so freshie and deepie may be used differently than in the past. Freshbot has always gone deep into my site though - after the deep crawl was over.
| 4:40 pm on May 22, 2003 (gmt 0)|
One of my band new sites is getting hit hard by freshbot so, maybe.....
| 4:45 pm on May 22, 2003 (gmt 0)|
Freshbot has been as thorough and deep as deep bot at my sites for several months now.
| 4:47 pm on May 22, 2003 (gmt 0)|
I'm a bit confused as to what you mean as Freshbot has always "acted" like deepbot on my pages - crawling them all?
Is it because I have only a few dozen pages?
| 4:51 pm on May 22, 2003 (gmt 0)|
So far fresh only got my home page...but that was only around 45 minutes ago, so it may be back...I'll let you know.
For the record, my site has only been online since February. This is the first time I've ever been visited by fresh, even though deep's visited me for two months now.
| 4:59 pm on May 22, 2003 (gmt 0)|
Well, I launched three new web sites about 10 days ago and they now show about 24 pages of each one on www and other google indexes. I thought that was only possible with deepcrawl, but I don't know that much.
| 5:04 pm on May 22, 2003 (gmt 0)|
Freshbot came today on a brand new site (online since last week) and grabbed every single page :)
| 5:14 pm on May 22, 2003 (gmt 0)|
While there are certainly some big changes underway, I think that this might be the way that google might choose to update the current index.
If you think about it, freshbot is designed to insert changes into the current index, whereas deepbot builds a changeset that gets merged all at once.
In the past few months I have seen many examples of fresh pages that seem to be "sticky", staying in the index, even after their fresh date disappears.
It might make sense for them to set up freshbot to crawl deep and set all the pages to sticky. Then shut it off for a few days while backlinks and PR are calculated.
I suspect that this method will take a little longer than doing a normal deep update cycle, but it will bring back in the missing sites and pages quicker. An additional disadvantage is that freshbot will be busy crawling deep instead of keeping the normal fresh pages fresh.
They might be writing off ever doing deep crawls again, but it is too early to make that call, while everything is still in flux.
| 5:19 pm on May 22, 2003 (gmt 0)|
We found new sites were sticking last month if linked to by high pr sites. I suspect this will become common for all new sites from now on.
| 5:21 pm on May 22, 2003 (gmt 0)|
freshie has always added new pages to the index but these had a date formatted in green (i.e. the name fresh)and did not necessarily "stick" until after deepbot found them. I haven't been watching closely, but haven't noticed those green fresh dates recently on the sites I normally watch.
I think this last month has changed a lot of "rules" and we will just have to wait awhile to find out the new ones, that is if the new ones are ever discernible. My gut says that it will be more difficult now and, hopefully, harder for the unethical to skew the index.
| 5:25 pm on May 22, 2003 (gmt 0)|
I don't know about not keeping things fresh BigDave. Why couldn't fresh come by and get all the pages for a site every two/three days?
| 5:43 pm on May 22, 2003 (gmt 0)|
Fresh could cover the entire web every couple of days, but they would need a lot more machines to handle that. I have never had fresh go deeper than PR4 for established pages.
If fresh is only picking up PR4+ pages, this is a very small minority of pages.
Right now deepbot does the portion of the web that it covers in about a week. That just isn't that easy to compress into every 3 days. Google also has a stated goal of increasing the number of pages in their index this year from 3G to 10G. That is a major increase in required processing power to fresh all those pages when a relatively small percentage of them change that often.
| 5:44 pm on May 22, 2003 (gmt 0)|
From my site (~50 pages) freshbot hasn't acted like deepbot at all.
I came two days ago and grabbed a few pages, came last night and grabbed a few more. Typically, deepbot comes and grabs them all within a day.
| 5:52 pm on May 22, 2003 (gmt 0)|
With three billion pages to crawl every three days that comes to approximately 11,000 pages per second, with bandwidth usage around 113 Mb/s (or approx 1Gb/sec) (assuming 10K of text per page (probably a little high)) --well within reach of a distributed solution. *My* server in the basement can serve up over 3,000 pages *per second*.
Given 10,000 machines for storage and distribution, you're only hitting on average 1 machine per second to crawl the entire web.
This is, of course, simplified; but crawling the entire web in three days is entirely doable from a bandwidth/retrieval standpoint with a distributed solution.
If they're gonna do 10 billion pages then multipy above by 3.5 :)
| 5:57 pm on May 22, 2003 (gmt 0)|
freshbot is acting a little like deepbot for me (PR5 site) - I have watched him follow new links up to 3 deep. I just opened a new section of my site and freshbot has already grabbed 1500 new pages within 24 hours of launch. But it seems to be a little picky and not grabbing it all... Deepbot would normally follow every link and grab every page on my site.
It would be nice to see freshbot take the place of deepbot and continuously crawl the net for pages and update as it goes, but it would require more resources on google's side to do a complete job... Heres to hoping :)
| 6:11 pm on May 22, 2003 (gmt 0)|
You are absolutely right on the bandwidth/retreval part, but as you yourself admit it is an over-simplification. Now go install a copy of htdig (an open source search engine for websites) on your computer. Find a nice little 10,000 page site to run it on. See how long it takes. Compare those numbers to running wget on that same site.
Building the index is not just fetching pages, it is processing those pages. Fitting those pages into an already active index is even tougher than rebuilding the index.
I'm not saying that it is impossible. I am just saying that it is not as easy as everyone that does not have to actually implement it seems to think that it is.
Just because it looks like google might be changing tactics right now, it means nothing. Right now their top priority is getting caught up, and they will be doing that using whatever means that they consider best. It is best not to assume that anything that is happening right now will have anything to do with how Google will be operating in July. Freshbot going deep is just an interesting datapoint.
| 6:18 pm on May 22, 2003 (gmt 0)|
Point 1/ On a 10,000 page site my crawler would retrieve 1 page every 2-3 seconds or so: polishing up in under a day. While I was waiting the 2-3 seconds I'd pull in pages from a couple hundred (or thousand) other, different, domains. A modern workstation could keep its (FIFO) schedule for crawling 5K to 10K domains in memory easily.
Point 2/ For the index I'd make it a distributed index across the backend servers, treating each server as a hash bucket for a machine-synchronized hash function on my crawlers--I'd have one set of backend servers for the forward index, and another for the backward index. By implementing the same hash function in my search machines I can query the appropriate server across the (very fast switched) network to get the information I need.
| 6:41 pm on May 22, 2003 (gmt 0)|
I just looked at one of my logs and freshie 64.68 spidered my site every day this month. The weird part is that it was looking for pages that are on one of my other sites from December. None of those pages exist anywhere on any of my site. All most all the spider activity from 64.68 was a 404 object not found error. The pages are the pages that I have when I go to google and do site:domain.com -asdf for my other site. I must have had those sites pointed to each other at some point. The site that is being spidered is a gray bar. I think that Google is stuck in the way back machine right now.
| 6:44 pm on May 22, 2003 (gmt 0)|
Sorry to drag you guys back on topic (BG)... but quick update.
Freshie is still in there and I can say that it is *deep*-crawling.
At least it is *deeper*-crawling than the old freshie we all know and love. It's freshie on steroids if you like.
It's not just going for high PR pages - I can't be specific I'm afraid, this is a new site and currently is PR0 although based on inbounds I would hazzard a guess it's about a PR5 or so based on the old system (remember the old days?! Like maybe March!). But it's digging up some internals.
It's followed, as far as I can tell at the moment (I will try and go through the logs later) pretty much every link and even been completely through our forum. Deepbot has done that before, never seen freshie in there unless there was a link from the front page to a particular thread.
Very odd but I'm quite enjoying a frollic in the long grass with freshie.
| 7:51 pm on May 22, 2003 (gmt 0)|
What you are describing is what I have seen in the past. PR only seems to make a difference with which "old" pages get hit by freshbot. Once freshbot finds *new* content it will follow those links for a while from the point at which it found them.
If I have
PR5 -> PR4 -> PR3 -> new -> new
freshbot never seems to find it.
On the other hand
PR5 -> new -> new -> new -> new
Could very easily get all the new pages listed. It doesn't always work that way, sometimes it will only go one or two pages from that PR5.
| 8:56 pm on May 22, 2003 (gmt 0)|
Ok, Freshbot is in my site slurping things up, but get this--it's going over pages that were deep crawled in April, not pages that could be followed by what it's gotten now.
Strange. It's almost as if fresh is going in and verifying pages that deep previously got.
| 9:04 pm on May 22, 2003 (gmt 0)|
fresh was never deeper
| 9:25 pm on May 22, 2003 (gmt 0)|
Thanks for the info.
What you've described is quite possibly what I'm seeing.
It is likely that if any of these pages have a "theoretical PR" (at the moment it's PR0 according to the toolbar) then it would be the index page.
And inevitably, everything is linked off the index page via one route or another.
Whatever it is I'm not complaining about it. It is not something I've experienced before, but then that's not to say it hasn't always been like this.....
| 9:41 pm on May 22, 2003 (gmt 0)|
All the reviews on my site log visits by the search engine bots. For all the other pages I just use the apache logs when I really need them. It is the reviews that I really care about.
Freshbot rarely gets to the reviews, as they are PR2-PR3 unless deep-linked from elsewhere. I do have a recent reviews page that freshbot does use to find the new reviews, though I put it up due to user request Googlebot liking it was just a bonus)
So far Freshie has not hit any of my reviews that I would not expect it to. It is going after the deep linked pages, and those linked from the recent reviews page. None of the others yet.
I would have to look at the apache logs to see if it is going deeper than normal, and I really don't want to do that right now. If it starts hitting the review pages real heavy, then I will start to believe that there is something else going on.
| 9:41 pm on May 22, 2003 (gmt 0)|
|Ok, Freshbot is in my site slurping things up, but get this--it's going over pages that were deep crawled in April, not pages that could be followed by what it's gotten now. |
I second this observation.
Fresh just arrived here going straight to a page which Google can only know about based on pages retrieved during the April deepbot visit.
[edited by: Gorilla at 9:45 pm (utc) on May 22, 2003]
| 9:43 pm on May 22, 2003 (gmt 0)|
I think fresh is verifying/re-crawling last month's deep crawl...hmmm...need a beer...
| 10:13 pm on May 22, 2003 (gmt 0)|
I don't have any good evidence to back up my feelings, but the crawling patterns I am seeing on my sites certainly looks more like typical Deepbot behaviour.
Maybe it isn't significant, but it is certainly interesting.
| 10:13 pm on May 22, 2003 (gmt 0)|
|I think fresh is verifying/re-crawling last month's deep crawl |
No I don't think so. It's not related - it's doing other stuff and its usual rounds at the same time.
| 10:18 pm on May 22, 2003 (gmt 0)|
Respectfully, that's impossible :)
Freshbot is crawling pages that are new and were *only* crawled during April's deep crawl. Fresh could not have known about the pages it's currently crawling any other way because a) Fresh has never visited these pages before, and b) The pages fresh is currently crawling cannot be linked to from the pages it has already crawled.
Fresh is most definately re-crawling pages from April's deep crawl, in my case anyway.
PS - She's pickin up steam--up to a page every two seconds. With the load she'll have to do I should see 7-10 pages/second pretty soon. (wheee)
| This 211 message thread spans 8 pages: 211 (  2 3 4 5 6 7 8 ) > > |