yep - looks like "panic" based spidering.
What do you mean by "panic" type spidering?
as if an index needs to be rebuilt from the ground up in a short time period (aka: the old index didn't work).
Maybe it's part of the PR calculation for the next update. The crawl is done for two reasons A) check whether page exists and B) get all links on that page. Because no index needs to be built up, the crawl is much faster than usual.
I dunno - but ive had 3 deep crawls in a week - I mean DEEP -
FYI, I too had deep crawls on Monday, Wednesday, and Thursday. With G downloading a Gig from my site, the spikes are easy to see on my traffic graphs ;-)
|Because no index needs to be built up, the crawl is much faster than usual. |
Indexing is not done in-between crawling pages, its done by other machines running in parallel that accept complete "data barrels" from crawler.
I knew that. But there must be some degree of coupling between the crawler and the indexer. If the indexer, or in this case the PR-calculator, is fast, it will process data at a higher rate and the crawlers can deliver data at a higher rate.
I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.
|I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong. |
I am positive that crawler and indexer are independent systems (as in original Google's design) that, while providing each other with data, are being run in parallel as opposed to page being crawled then indexed, then moving to the next page.
Fundamentally the indexer depends on data crawled by the crawler. On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data. I can see two distinct scenarios:
a) indexer not running (as you suggest), in this case crawler would run out of URLs to do and stop at some point, and I see no reason to run it faster than normal if indexing is not performed, why indeed?
b) indexer is running and adds new URLs for crawler to do - in this case crawler would have to work its proverbial arse off (like what seems to happen in our case) in order to cope with demands for data from the indexer, and with all that computing power they have the bottleneck is bound to be in "bandwidth" area, ie crawler, so this scenario in my view is the most likely at the moment
I therefore conclude, it is far more likely that the index had to be rebuilt quickly so they pulled off all stops and started crawling like mad with indexer running on full power too - contrary to your original suggestion and inline with what Brett is suggesting.
Perhaps someone with (or without) better knowledge of search engine design would like to correct me ;)
>On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data.
If that's true, there is not much left to do for the crawler, is there? No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.
>and with all that computing power they have the bottleneck is bound to be in "bandwidth" area
Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code? No way! The indexer is the bottleneck, not the crawlers.
|No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer. |
Only amateur crawlers extract URLs from crawled page to continue crawling these - you need to ensure that you don't crawl same URLs all over again, otherwise crawlers that run in parallel on many machines might totally kill website because they will happen to crawl same pages. This is the job of indexer to analyse pages and product nice work units with clean URLs for crawlers to get so that you do not waste bandwidth by getting same stuff over and over again (and pissing off webmasters).
|Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code? |
They do have lots of machines, primarily not for CPUs but for having lots of RAM to keep all index in memory. They therefore have a LOT of CPU power, 100k machines some estimate, if you spread all 4 bln URLs onto these, then there will be almost nothing to do for these boxes - 40k pages each, its nothing and good indexer does these in matter of minutes.
|The indexer is the bottleneck, not the crawlers. |
I stand by that the crawler is the main bottleneck - not least because websites have natural limit to how fast they serve pages - you might have 100Mbit connection, but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.
Everything from my point of view supports the original answer by Brett that Google is for some reason rebuilding index ASAP.
>Only amateur crawlers extract URLs from crawled page to continue crawling these
That's not what I said. I said that the crawlers extract the URLs themselfes. Obviously, there needs to be a central queue of URLs that all crawlers feed into and from such that two crawlers do not crawl the same page at a time.
>They therefore have a LOT of CPU power, 100k machines some estimate
That's not what I heard. It's rather 10k machines.
>lots of RAM to keep all index in memory
>but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.
Hence no bottleneck because one crawler crawles multiple URLs in parallel or one crawler machine runs multiple crawler processes in parallel. When crawling, my FTP search engine has 100 connections to different sites open at all times on a 2Mbit line and I can't index data as fast as it comes in.
|That's not what I said. I said that the crawlers extract the URLs themselfes. |
This is not how Google works, here is diagram of their original design which I doubt changed that much as it still make sense now:
Notice that Crawler is fed by the URL server, and crawled pages have to go through Indexer before new URLs are passed to the crawler. If you have more up to date scheme on how Google works, then please share it (as long as its legal of course).
|That's not what I heard. It's rather 10k machines. |
Thats outdated info - even if it was 10k, then 4 bln URLs only requires processing of 400k URLs on each of the boxes - this is nothing in terms of CPU requirements.
|>lots of RAM to keep all index in memory |
Its not speculation its fact - their original (past Stanford Backrub) design feature was to keep all index in memory, which is why they gone for 2Gbs of RAM in each of the boxes. I suggest you to check references on the matter and if you find something that says otherwise then please share it with me.
|I can't index data as fast as it comes in. |
Well the reasons why you can't index data as fast as you crawl are beyond scope of this conversation, I'd speculate that your indexers are just too slow, hence for you bottleneck is not bandwidth, unlike Google's situation ;)
has the googlebot IP too changed.
I am getting two new IPs crawling me. A query just says its proxy.google.com and the other has no DNS but doing a whois i get Googlebot.
The ips are
and more starting with 216.239.
Before it used to crawl me with the ip 64.68. Why has the ip changed. Any ideas? Anyone?
I also saw hard crawling early Thursday - hundreds of pages over 30 minutes (2 sessions of 15 minutes). Very unusual pattern for my site and googlebot.
My site which has nearly 200,000 pages , only 4109 pages got crawled yesterday...Is it partial crawling, do you think googlebot will return to crawl rest of the pages. My site has 6PR. Does anybody have similar experiences, that is, partial crawling?
I have been getting nothing but partial crawls for the past month. Might be some other problems I have been having though.
As for IPs of googlebot, I too noticed the change. I have benngetting hit by 66.249.65.x
Do you have any reason why the ips have changed.
Not that I know of. It's googlebot that has changed I am sure. Nothing I did.
Oh No! Maybe a crawl from this IP is an indication I am going to be dropped!..... Just Kidding :-)
OK, so then for my elementary understanding of ths situation, what impact does that have on SERPS?
IP's crawled on my CLEAN WEBSITE with no links to affiliate programs, no external links:
IP's crawled on product website:
Same block of IP's on both categories.
So, I believe it has nothing to do with removing our pages. Maybe google is trying out different things.
I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.
|I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway. |
Man Oh Man I Hope So. As you know I have been just a little invloved on that thread.
I saw some heavy spidering earlier this week on new pages. I wouldn't go so far as to say they are rebuilding an old index if they're going after new content. The gbot has significantly increased its activity the past few days. A new site I released a month ago was hit 50k times within the past few days. I wish I knew why.
johnnyb, when you posted the two series of IP addresses for Googlebot, 66.249.64.X and 66.249.65.X, I looked through my logs and this is what I found.
The 66.249.64.X series was requesting pages that were fully indexed i.e., they have a page title and description.
The 66.249.65.X series was requesting pages that were only partially indexed, i.e., the did not have a title or description when listed by Google using site:
In my case, the 66.249.65.X were pages that exist on my server but I am trying to get Googlebot to stop indexing. I have no links pointing to those pages but Google knows about them and keeps on requesting them.
Not sure what all this means but that is what I am seeing regarding those two IP series.
I see google as having acquired the following blocks...
66.249.64.x - 66.249.79.x
Shows a registration date of 03/05/04
That's quite a hunk of IPs.
uhm... a new datacenter perhaps? pure speculation of course...
And my robot detector had to be set up to recognise these bots. I am almost preparing for its one in 13 days (i dont know why G choose that no for me) full scan.
Wonder if it will be affected as its already the 14th day.
Just took time to analyse the stats.
what i found was that once all the pages were crawled by 66.249.65 bot, then a few pages were immediately crawled by the 216.239.37 and the few pages that were crawled by this bot started appearing in the SERPs within hours (six - ten)
Whats going on.
[edited by: boredguru at 7:43 pm (utc) on Sep. 24, 2004]
| This 176 message thread spans 6 pages: 176 (  2 3 4 5 6 ) > > |