Forum Moderators: open
I knew that. But there must be some degree of coupling between the crawler and the indexer. If the indexer, or in this case the PR-calculator, is fast, it will process data at a higher rate and the crawlers can deliver data at a higher rate.
I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.
I'm not saying that my hypothesis is necessarily correct, it's just that what you say does not prove me wrong.
I am positive that crawler and indexer are independent systems (as in original Google's design) that, while providing each other with data, are being run in parallel as opposed to page being crawled then indexed, then moving to the next page.
Fundamentally the indexer depends on data crawled by the crawler. On the other hand the crawler depends on additional URLs that indexer gets by analysing crawled data. I can see two distinct scenarios:
a) indexer not running (as you suggest), in this case crawler would run out of URLs to do and stop at some point, and I see no reason to run it faster than normal if indexing is not performed, why indeed?
b) indexer is running and adds new URLs for crawler to do - in this case crawler would have to work its proverbial arse off (like what seems to happen in our case) in order to cope with demands for data from the indexer, and with all that computing power they have the bottleneck is bound to be in "bandwidth" area, ie crawler, so this scenario in my view is the most likely at the moment
I therefore conclude, it is far more likely that the index had to be rebuilt quickly so they pulled off all stops and started crawling like mad with indexer running on full power too - contrary to your original suggestion and inline with what Brett is suggesting.
Perhaps someone with (or without) better knowledge of search engine design would like to correct me ;)
If that's true, there is not much left to do for the crawler, is there? No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.
>and with all that computing power they have the bottleneck is bound to be in "bandwidth" area
Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code? No way! The indexer is the bottleneck, not the crawlers.
No, the crawler filters new URL's out on its own. Also it is fed existing URLs from the database and not the indexer.
Only amateur crawlers extract URLs from crawled page to continue crawling these - you need to ensure that you don't crawl same URLs all over again, otherwise crawlers that run in parallel on many machines might totally kill website because they will happen to crawl same pages. This is the job of indexer to analyse pages and product nice work units with clean URLs for crawlers to get so that you do not waste bandwidth by getting same stuff over and over again (and pissing off webmasters).
Word-based indexers are computationally expensive. That's what the G needs the computing power for. You think they buy all this machines and then force them to idle by putting stops into the code?
They do have lots of machines, primarily not for CPUs but for having lots of RAM to keep all index in memory. They therefore have a LOT of CPU power, 100k machines some estimate, if you spread all 4 bln URLs onto these, then there will be almost nothing to do for these boxes - 40k pages each, its nothing and good indexer does these in matter of minutes.
The indexer is the bottleneck, not the crawlers.
I stand by that the crawler is the main bottleneck - not least because websites have natural limit to how fast they serve pages - you might have 100Mbit connection, but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.
Everything from my point of view supports the original answer by Brett that Google is for some reason rebuilding index ASAP.
That's not what I said. I said that the crawlers extract the URLs themselfes. Obviously, there needs to be a central queue of URLs that all crawlers feed into and from such that two crawlers do not crawl the same page at a time.
>They therefore have a LOT of CPU power, 100k machines some estimate
That's not what I heard. It's rather 10k machines.
>lots of RAM to keep all index in memory
Pure speculation.
>but if site give you data at rate of 10kbit then you will have to wait till you get data - hence bottleneck.
Hence no bottleneck because one crawler crawles multiple URLs in parallel or one crawler machine runs multiple crawler processes in parallel. When crawling, my FTP search engine has 100 connections to different sites open at all times on a 2Mbit line and I can't index data as fast as it comes in.
That's not what I said. I said that the crawlers extract the URLs themselfes.
This is not how Google works, here is diagram of their original design which I doubt changed that much as it still make sense now:
[www-db.stanford.edu...]
Notice that Crawler is fed by the URL server, and crawled pages have to go through Indexer before new URLs are passed to the crawler. If you have more up to date scheme on how Google works, then please share it (as long as its legal of course).
That's not what I heard. It's rather 10k machines.
Thats outdated info - even if it was 10k, then 4 bln URLs only requires processing of 400k URLs on each of the boxes - this is nothing in terms of CPU requirements.
>lots of RAM to keep all index in memoryPure speculation.
Its not speculation its fact - their original (past Stanford Backrub) design feature was to keep all index in memory, which is why they gone for 2Gbs of RAM in each of the boxes. I suggest you to check references on the matter and if you find something that says otherwise then please share it with me.
I can't index data as fast as it comes in.
Well the reasons why you can't index data as fast as you crawl are beyond scope of this conversation, I'd speculate that your indexers are just too slow, hence for you bottleneck is not bandwidth, unlike Google's situation ;)
And
66.249.65.12
66.249.65.51
66.249.65.6
Before it used to crawl me with the ip 64.68. Why has the ip changed. Any ideas? Anyone?
IP's crawled on product website:
66.249.64.195
66.249.64.47
66.249.65.73
Same block of IP's on both categories.
So, I believe it has nothing to do with removing our pages. Maybe google is trying out different things.
I expect alot of cloaking and redirect sites will be dropped soon from these new bot ip's and this crawl. It's what I had in mind in the post about hijacks when I said I think Google is on it. They have been asking for file paths and filenames with extensions I have never used before. I am hopeful anyway.
Man Oh Man I Hope So. As you know I have been just a little invloved on that thread.
The 66.249.64.X series was requesting pages that were fully indexed i.e., they have a page title and description.
The 66.249.65.X series was requesting pages that were only partially indexed, i.e., the did not have a title or description when listed by Google using site:
In my case, the 66.249.65.X were pages that exist on my server but I am trying to get Googlebot to stop indexing. I have no links pointing to those pages but Google knows about them and keeps on requesting them.
Not sure what all this means but that is what I am seeing regarding those two IP series.
[edited by: boredguru at 7:43 pm (utc) on Sep. 24, 2004]