homepage Welcome to WebmasterWorld Guest from 174.129.130.202
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 176 message thread spans 6 pages: < < 176 ( 1 [2] 3 4 5 6 > >     
Gbot running hard
ncw164x




msg:167902
 9:04 am on Sep 23, 2004 (gmt 0)

googlebot requesting between 2 - 5 pages a second, not seen this type of spidering for a long time

 

AthlonInside




msg:167932
 7:42 pm on Sep 24, 2004 (gmt 0)

66.249.65.239 crawl my index for 26 times in a day. It must be very hungry.

nuclei




msg:167933
 8:07 pm on Sep 24, 2004 (gmt 0)

as if an index needs to be rebuilt from the ground up in a short time period (aka: the old index didn't work).

I tend to agree, something is definately going on. And if an algo change screwed the real index too badly, they would not want to have to revert to an older one if at all possible. You could well be right.

Liane




msg:167934
 9:44 pm on Sep 24, 2004 (gmt 0)

I agree with Brett. Something must be causing this feeding frenzy and it wouldn't surprise me if there was a glicth with the index.

Google went nuts every day this past week on my site, but in the last 24 hours ... only one hit. Never had that before. Not that I can remember anyway. Very unusual behaviour.

I smell a "major" update in the offing ... once they get things sorted.

HayMeadows




msg:167935
 11:26 pm on Sep 24, 2004 (gmt 0)

Interesting, www2.google.com redirects to google.com. Not sure if this is new or not. Apologize if this is old news.

mahlon




msg:167936
 12:02 am on Sep 25, 2004 (gmt 0)

Interesting, www2.google.com redirects to google.com. Not sure if this is new or not. Apologize if this is old news.

I noticed that a few days ago.

something is up, I have my guesses but I'm keeping quiet

HayMeadows




msg:167937
 12:04 am on Sep 25, 2004 (gmt 0)

Major algo shift just before Christmas of course. Lets just hope they aren't missing 75% of the good sites for no reason! I think we'll all be praising Google soon, at least that's my hope!

DslLmi




msg:167938
 5:11 am on Sep 25, 2004 (gmt 0)

I have been getting a ton of bots from 66.249 IP range... About 20 diffrent IP address! Any Ideas whats going on!

Hanu




msg:167939
 10:52 am on Sep 25, 2004 (gmt 0)

Lord Majestic,

>here is diagram of their original design which I doubt changed that much

This design is, ahem, how old? Even if it makes sense to you today, it doesn't mean anything about whether it's actually still in use. There are a lot of things that make sense but aren't in use anymore.

>>That's not what I heard. It's rather 10k machines.

>Thats outdated info

I'd be interested to see actual evidence for this statement.

>I suggest you to check references on the matter and if you find something that says otherwise then please share it with me.

That's funny. You have made the claim so you back it up, not me.

Regarding you style of argumentation: Your original claim was that G has breaks in the indexer code and that for "panic crawls" they take these out. My claim was that they have no artificial breaks in the indexer code and that "Gbot hitting hard" just means that they are doing a quick URL check in preparation of a PR update. Nothing you said has proven me wrong. My experience with indexers is that bandwidth is not a bottleneck. It might be my own experience and it might not apply to G. On the other hand, you seem to take a lot of 'mights' for granted, too. With both of us shooting in the dark, I suggest that we leave it at that. You can't prove someone's hypothesis wrong by coming up with your own.

boxhead




msg:167940
 11:00 am on Sep 25, 2004 (gmt 0)

Yeah something crazy is going on because I havent seen google hit so many 404 (100+) then it has in the last 2 weeks, it hasnt touched those pages since April and now those 404 are appearing in the SERPS but with April 23 cache date but instead of 404 its showing old content.

Anna.

Lord Majestic




msg:167941
 11:23 am on Sep 25, 2004 (gmt 0)

Lord Majestic,

>here is diagram of their original design which I doubt changed that much

This design is, ahem, how old? Even if it makes sense to you today, it doesn't mean anything about whether it's actually still in use. There are a lot of things that make sense but aren't in use anymore.

Since none of us know for sure its a question of probabilities what is more likely. It is more likely that crawler does not parse HTML for new links because its CPU intensive operation that is done by indexer anyway and its more efficient to combine URL extraction with general indexing because you have to go through all the tokens in HTML. This is why in original design they did not have crawler doing that job that is unnecessary at this stage.

I have some evidence, what do you have to disprove my theory?


>>That's not what I heard. It's rather 10k machines.

>Thats outdated info

I'd be interested to see actual evidence for this statement.

I said that some people estimate that they have 100k machines, what you heard about 10k machines is ancient history and while I don't know exact number of machines (which is irrelevant to this argument anyway as I have shown that 10k machiens is good enough for indexer not to be bottleneck) it is clear that Google is serving more pages now than ever before hence they ought to have more machines than 10k that was first reported years ago.

Since you are so insistant on having references here is one: [tnl.net...]

Quote: "Based on quick back of the envelope calculations, it looks like Google is managing between 45,000 and 80,000 servers."

This ends my arguement about how many boxes Google is managing as its irrelevant anyway since 10k would be enough for the purpose of indexing.

That's funny. You have made the claim so you back it up, not me.

I believe I have provided enough argumentation to support my point of view, whether you accept it or not it not of big consequence to me because I am not being exactly paid for doing that - its up to you to decide what you choose to believe :)

My experience with indexers is that bandwidth is not a bottleneck.

We are talking about Google not you, ok? Given that we know that Google has got at least 10k machines, of which it is reasonable to assume they have at least 1k doing indexing in parallel, then in order to index 4 bln URLs in their database each machine will need to index 4,000,000 URLs. Lets say they can do these in 24 hours, this gives us reasonable speed of ~45 URLs per second. If they are in a rush then they can use whole 10k machines, in which case indexing speed will be ~5 URLs per second, and I am sure their highly tuned C++ code has at least that speed.

So, if they can index whole database in 24 hours, then what kind of bandwidth they would need to download in order to keep up with crawlers? Well, at average page size of 40KB, they will need to be downloading at rate of: 4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second. Is that reasonable speed of their download? I don't think so, not even Google is that good ;)

Finally, another historical quote from Google's original design which I believe stood the test of time successfully:

"We ran the indexer and the crawler simultaneously. The indexer ran just faster than the crawlers. This is largely because we spent just enough time optimizing the indexer so that it would not be a bottleneck. These optimizations included bulk updates to the document index and placement of critical data structures on the local disk. The indexer runs at roughly 54 pages per second."

At this stage I consider that I have provided sufficient argumentation to conclude that your original saying "Because no index needs to be built up, the crawl is much faster than usual." is highly likely to be incorrect because it is crawling that is bottleneck, therefore having indexer not running in parallel will not cut down overall time.

You can't prove someone's hypothesis wrong by coming up with your own.

I think you should tell that to all sciences that are mainly based on theory like astrophysics ;)

claus




msg:167942
 12:45 pm on Sep 25, 2004 (gmt 0)

Although this discussion is entertaining, i think the last post (and perhaps other posts as well) did mix up a few words. I think it will be helpful to stick to this terminology:

1) Crawling: (Or Spidering) - for the process of sending robots out to fetch content
1a) Delivery: (Or Data dump) - for the process of getting the crawled matter from the crawlers to the indexer(s)
2) Indexing: For the process of receiving this content, parsing, sorting and ordering it
3) Scoring: For the process of assigning PR to pages and doing all the other algo-work

Stage (1a) might be seen as part of stage (1) - also, some initial parsing (feeding easily identifiable urls to another sub-system) might be done as part of the crawling process, but i tend to think that Google would want complete data to be delivered to the indexer(s) in any case.

---
Added:
4) Ranking: For what happens when a query is sent and results are selected and displayed in the SERPS.

[edited by: claus at 12:51 pm (utc) on Sep. 25, 2004]

Lord Majestic




msg:167943
 12:51 pm on Sep 25, 2004 (gmt 0)

i tend to think that Google would want complete data to be delivered to the indexer(s) in any case.

Yes, and this is how it was originally designed as shown on the following diagram: [www-db.stanford.edu...] The designs of that kind of high level rarely change, the cars now follow pretty similar high level principles as they were 50 years ago.

The main argument was however about whether lack of indexing would make crawling faster (hence explain current speed of crawling), but I think my calculations show that the bottleneck is at the crawling stage, and therefore lack of indexing would not explain current situation.

In fact it would be pretty pointless to just crawl and not index when you have the kind of computing power Google has.

claus




msg:167944
 1:09 pm on Sep 25, 2004 (gmt 0)

>> to just crawl and not index

Yes, why would you want to do that? Assume you're quite satisfied with the overall index quality and the scoring produces ranking within quality guidelines as well:

In this particular case, the only reason why you would want to do that would be to make some very specific changes to the index, eg. cleaning 404 pages, sorting out redirects, detecting networks, or recently updated pages, or a mulititude of other potential specifics that we can only guess about.

Such specific tasks could be carried out without returning the full data dump to indexers, and hence increase spidering frequency. But still, it's just an educated guess at best - pure speculation at worst.

---
Added:
This sounds as i'm supporting the theory that crawling without indexing is taking place. In fact, i simply don't know. (added: some indexing would need to happen, but not necessarily "building a whole index")

The indexing bottleneck (if there ever was such a construct) could be amended by having proper methods for data reception and preliminary storage, as well as imposing speed constraints on the bots. So i would agree that there should be no bottleneck there.

I do see transfer of vast amounts of data as a potential bottleneck, however (ie. subprocess (1a) above). This does not concern the Google infrastructure as much as the whole internet infrastructure (ie. the capacity for data transfer, or size of their "inbound pipes").

That's my only argument for stating that specific crawling (without returning the full data set) can happen at a higher speed than "ordinary crawling". Ie. that making corrections to an existing index could happen faster (edit: in terms of crawling) than building an entirely new index.

Still, this process might as well happen in the background of the ordinary crawl, by reserving some of the thousands of bots for "maintenance".

[edited by: claus at 1:37 pm (utc) on Sep. 25, 2004]

Simon Jester




msg:167945
 1:16 pm on Sep 25, 2004 (gmt 0)

Google's cache has just reverted to one dated 9/11 for my site, although they've had several more recent ones.

Weird...

BillyS




msg:167946
 1:18 pm on Sep 25, 2004 (gmt 0)

So, if they can index whole database in 24 hours, then what kind of bandwidth they would need to download in order to keep up with crawlers? Well, at average page size of 40KB, they will need to be downloading at rate of: 4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second. Is that reasonable speed of their download? I don't think so, not even Google is that good ;)

Why aren't they that good? You stated earlier that they have 100,000 machines with 2 gigabytes of memory each. Let's say that they dedicate 10,000 to crawling. That is only 1.5 Megabits / second per machine.

I'm just using some of your numbers, I mean this seems pretty possible to me if your right. Are you right?

Unless you've designed their system or work with the servers, your speculation is as good as mine. I'm sure the Google engineers will chuckle over this "factless" debate.

dataguy




msg:167947
 1:30 pm on Sep 25, 2004 (gmt 0)

Pardon my jumping in, I really don't follow Google as closely as I should, but some of these assertions seem flawed from the beginning.

If we're talking about the index processing bottleneck and then arguing about how many servers "Google is managing", what do the two have in common? Google certainly doesn't use all their servers for processing the index, nor do they use all of them for crawling. A substantial number of their 10,000-100,000 servers must be used for serving up search results to end users, serving AdSense ads, etc. Hardware is cheap, I'm sure Google has as many servers dedicated to each job as needed, as well as plenty of bandwidth dedicated to crawling and serving up their data.

creative craig




msg:167948
 2:20 pm on Sep 25, 2004 (gmt 0)

I have seen a deep crawl three times in the last week, I have not seen the other Google Ip addresses hit me at all though.

On a side note I have seen msnbot hit me hard in the last three days as well.

Lord Majestic




msg:167949
 3:16 pm on Sep 25, 2004 (gmt 0)

Why aren't they that good? You stated earlier that they have 100,000 machines with 2 gigabytes of memory each. Let's say that they dedicate 10,000 to crawling. That is only 1.5 Megabits / second per machine.

Boxes are all fixed low cost, however bandwidth is still relatively expensive, especially when you require THAT MUCH - 15 Gigabits sustained over period of time - this is very high load and its not cheap. Google's normal internal SLA is to crawl it all over few months, so costs of doing so are far lower - they only need 166 Mbits to achieve same results (1500/90).

Unless you've designed their system or work with the servers, your speculation is as good as mine

No, I beg to differ - some calculations based on fairly likely factors are well above speculations - its at least a guestimate ;)

Why am I sure that bandwidth is bottleneck? Because its typical for apps of that kind to be IO-bound, they solved problem with hard drives IO limitation due to latency by loading index into memory, but connectivity (bandwidth) is still a general limitation for apps of that kind (unless you on Internet2). In addition to this crawling is not CPU intensive, so its possible to run indexer on the same box with crawler, and given that there is not much to optimise in crawler (its IO limited), then I'd imagine indexers are fast enough (as it was in their original version I quoted above).

I'm sure the Google engineers will chuckle over this "factless" debate.

Does not bother me in the slightest - I am chuckling here myself knowing that they do know the truth but they can't tell it for obvious reasons - its like the itch that they can't scratch ;)

Anyway, people seems to have missed the main point - I asserted that contrary to what original poster on that topic thought crawling will not be faster if indexing turned off. All what I said was just to support my assertion.

ogletree




msg:167950
 4:58 pm on Sep 25, 2004 (gmt 0)

Well G does have a lot of extra money now. They started off being very cost effective and that is what made them so profitable. Now they have so much money they don't know what to do with it. People tend to spend money when they have it. Plus there is a lot of preasure to spend it. I think the public thing will hurt G's bottom line. Just because you have lots of credit does not mean you have to spend it. At the begining they did not spend money unless it was necessary to make more. Now they are going to spend money just because they have it and have a much lower return on their investment. This is what all big companies do. There are people now that have been given a big budget to do something with and they feel like they have to do it.

AthlonInside




msg:167951
 5:01 pm on Sep 25, 2004 (gmt 0)

I changed my title 2 weeks ago and now it is reflected in almost all the datacenters.

metatarsal




msg:167952
 5:18 pm on Sep 25, 2004 (gmt 0)

Google's just a search engine.

I do my best to ignore it these days ;-)

gatekeeper




msg:167953
 6:00 pm on Sep 25, 2004 (gmt 0)

I agree.

G-bot is hitting my largest site unlike ever before.

Was still wondering though ... how long till the cache of the new web page, or newly spidered webpage shows up in the google index?

Anybody have any ideas?

scoreman




msg:167954
 6:25 pm on Sep 25, 2004 (gmt 0)

Come on, enough bickering. lets get back to the topic at hand. Im seeing gbot coming from those IPs as well. 29 times this months on one domain, 16 on another... Im still new at this SEO stuff, anyone had this happened before?

Rick_M




msg:167955
 6:57 pm on Sep 25, 2004 (gmt 0)

Not sure if it's related. My site that is ranking oddly since Sept 23rd is only showing cache dates from before Sept 12th. This particular site consistently ranked #1 for many terms for around 1 year, and since Sept 23rd, I don't have a single #1 ranking (even the site name), but a lot in the 20-30's, and a few top 10.

I hope it was part of a spidering gone bad, as I liked my old rankings.

My site is not being spidered all that differently, though, the past few days at least.

andystowell




msg:167956
 7:26 pm on Sep 25, 2004 (gmt 0)

Rick_M - exactly the same experienced here, a site ranked consistently within the top 3 for many terms now shifted to pages 2,3 or lower for the majority of terms, some however still ranking well.

I too was hoping it was some spidering problem but am starting to think differently now :-(

Rick_M




msg:167957
 7:38 pm on Sep 25, 2004 (gmt 0)

andystowell,

When did you site drop down?

BillyS




msg:167958
 8:21 pm on Sep 25, 2004 (gmt 0)

Now I know this might also seem unrelated, but Google is Adwords and Adsense too.

Today I am not showing any data for Adsense. Could it be that this is another sign of a scramble? Dedicating more resources to data gathering or computations?

I'm not alone... [webmasterworld.com...]

more to support Brett's theory! Or an attempt to punch through the 4 billion page barrier...

[edited by: BillyS at 8:36 pm (utc) on Sep. 25, 2004]

andystowell




msg:167959
 8:31 pm on Sep 25, 2004 (gmt 0)

Rick_M - My site started dropping Thursday 23rd.

BillyS, Unable to even login to AdSense at this end...

BillyS




msg:167960
 8:55 pm on Sep 25, 2004 (gmt 0)

I should have checked this too. Adsense and Adwords having problems...

[webmasterworld.com...]

[webmasterworld.com...]

Big problems at Google. My bet is now with Brett - big change to index, lost index or my related theory of breaking past 4 billion pages - a show of strength post stock offering.

metatarsal




msg:167961
 9:06 pm on Sep 25, 2004 (gmt 0)

Ooooooh,

So what's Goooooogle dooooooing now?

Who gives a f*ck!

Stop girlying; and get on with your site.

It's a pile of crap - and you'll put yourself into an early grave worrying about an algo that can never be figured, and is now driven by money.

Give up, get off the boards, and get on with your life.

There's no life here.

(admittedly a rather medieval perspective, but there is probably some truth in this analysis etc....)

p.s. and I'm a 'new user' am I - what crap one sees on the inet.

p.p.s. Ooooooh - I couldn't help it - some tech problems with NoSense in the post above, or whatever.

Well who gives a Kate.

Anyone read a good book recently?

Lord Majestic




msg:167962
 9:17 pm on Sep 25, 2004 (gmt 0)

Anyone read a good book recently?

Good point - I quit my full time job to do just that (among other things) :)

This 176 message thread spans 6 pages: < < 176 ( 1 [2] 3 4 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved