Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: open
I'm being hit very hard by google's freshbot at the moment, and going deep too. At first glance at what is currently going on with the little guys, I had to check and double check that the IP's were 64.... (they are).
It's behaviour, in terms of hard hitting and depth of crawl (it's going through the entire site) is more like the character of the old deepbot.
In fact, it's identical behaviour to deepbot the last time it crawled this site back in April.
I'm interested in hearing from others who are seeing the same.
It would be nice to see freshbot take the place of deepbot and continuously crawl the net for pages and update as it goes, but it would require more resources on google's side to do a complete job... Heres to hoping :)
You are absolutely right on the bandwidth/retreval part, but as you yourself admit it is an over-simplification. Now go install a copy of htdig (an open source search engine for websites) on your computer. Find a nice little 10,000 page site to run it on. See how long it takes. Compare those numbers to running wget on that same site.
Building the index is not just fetching pages, it is processing those pages. Fitting those pages into an already active index is even tougher than rebuilding the index.
I'm not saying that it is impossible. I am just saying that it is not as easy as everyone that does not have to actually implement it seems to think that it is.
Just because it looks like google might be changing tactics right now, it means nothing. Right now their top priority is getting caught up, and they will be doing that using whatever means that they consider best. It is best not to assume that anything that is happening right now will have anything to do with how Google will be operating in July. Freshbot going deep is just an interesting datapoint.
Point 1/ On a 10,000 page site my crawler would retrieve 1 page every 2-3 seconds or so: polishing up in under a day. While I was waiting the 2-3 seconds I'd pull in pages from a couple hundred (or thousand) other, different, domains. A modern workstation could keep its (FIFO) schedule for crawling 5K to 10K domains in memory easily.
Point 2/ For the index I'd make it a distributed index across the backend servers, treating each server as a hash bucket for a machine-synchronized hash function on my crawlers--I'd have one set of backend servers for the forward index, and another for the backward index. By implementing the same hash function in my search machines I can query the appropriate server across the (very fast switched) network to get the information I need.
Freshie is still in there and I can say that it is *deep*-crawling.
At least it is *deeper*-crawling than the old freshie we all know and love. It's freshie on steroids if you like.
It's not just going for high PR pages - I can't be specific I'm afraid, this is a new site and currently is PR0 although based on inbounds I would hazzard a guess it's about a PR5 or so based on the old system (remember the old days?! Like maybe March!). But it's digging up some internals.
It's followed, as far as I can tell at the moment (I will try and go through the logs later) pretty much every link and even been completely through our forum. Deepbot has done that before, never seen freshie in there unless there was a link from the front page to a particular thread.
Very odd but I'm quite enjoying a frollic in the long grass with freshie.
What you are describing is what I have seen in the past. PR only seems to make a difference with which "old" pages get hit by freshbot. Once freshbot finds *new* content it will follow those links for a while from the point at which it found them.
If I have
PR5 -> PR4 -> PR3 -> new -> new
freshbot never seems to find it.
On the other hand
PR5 -> new -> new -> new -> new
Could very easily get all the new pages listed. It doesn't always work that way, sometimes it will only go one or two pages from that PR5.
Strange. It's almost as if fresh is going in and verifying pages that deep previously got.
Thanks for the info.
What you've described is quite possibly what I'm seeing.
It is likely that if any of these pages have a "theoretical PR" (at the moment it's PR0 according to the toolbar) then it would be the index page.
And inevitably, everything is linked off the index page via one route or another.
Whatever it is I'm not complaining about it. It is not something I've experienced before, but then that's not to say it hasn't always been like this.....
Freshbot rarely gets to the reviews, as they are PR2-PR3 unless deep-linked from elsewhere. I do have a recent reviews page that freshbot does use to find the new reviews, though I put it up due to user request Googlebot liking it was just a bonus)
So far Freshie has not hit any of my reviews that I would not expect it to. It is going after the deep linked pages, and those linked from the recent reviews page. None of the others yet.
I would have to look at the apache logs to see if it is going deeper than normal, and I really don't want to do that right now. If it starts hitting the review pages real heavy, then I will start to believe that there is something else going on.
Ok, Freshbot is in my site slurping things up, but get this--it's going over pages that were deep crawled in April, not pages that could be followed by what it's gotten now.
I second this observation.
Fresh just arrived here going straight to a page which Google can only know about based on pages retrieved during the April deepbot visit.
[edited by: Gorilla at 9:45 pm (utc) on May 22, 2003]
Respectfully, that's impossible :)
Freshbot is crawling pages that are new and were *only* crawled during April's deep crawl. Fresh could not have known about the pages it's currently crawling any other way because a) Fresh has never visited these pages before, and b) The pages fresh is currently crawling cannot be linked to from the pages it has already crawled.
Fresh is most definately re-crawling pages from April's deep crawl, in my case anyway.
PS - She's pickin up steam--up to a page every two seconds. With the load she'll have to do I should see 7-10 pages/second pretty soon. (wheee)