homepage Welcome to WebmasterWorld Guest from 54.242.126.9
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 176 message thread spans 6 pages: < < 176 ( 1 2 [3] 4 5 6 > >     
Gbot running hard
ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 9:04 am on Sep 23, 2004 (gmt 0)

googlebot requesting between 2 - 5 pages a second, not seen this type of spidering for a long time

 

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 9:17 pm on Sep 25, 2004 (gmt 0)

Anyone read a good book recently?

Good point - I quit my full time job to do just that (among other things) :)

Hanu

10+ Year Member



 
Msg#: 25897 posted 10:48 pm on Sep 25, 2004 (gmt 0)

Lord Majestic,

nevertheless you decided to end the discussion about the number of machines, I take the liberty to add something to it. You are not allowed to reply. ;-)

In 2002 c't magazine (a well-known german computer magazine) published an interview with Google fellow Urs Hölzle. The article is in German and available pay-per-click only but I'll sticky it to you if you like. He said that they had at that point 10k machines. The blog article you mention references an IEEE article from 2003 by three G employees in which the number of machines is said to be 15k. 10k in 2000, 15k in 2003 - let's say they are now at 20k.

>4 bln pages * 40KB * 8 bits / 86,400 secs in day = 15 GigaBit / second

Yes, 16 GB/s is not reasonable. In the above mentioned interview, Hölzle said that the index turn-around is 28 days with 3 billion URLs in the index at that time. Feeding the number 3 billion URLs into your calculation, one gets 12 GB/s. Divide this by 28 days and 5 DCs, one gets 90 MB/s. A more reasonable number. We also need to take into account that the DC's must replicate the index among each other and that there is query traffic. I admit that your and Brett's theory is more likely than mine.

Muskie

10+ Year Member



 
Msg#: 25897 posted 5:17 am on Sep 26, 2004 (gmt 0)

This month in my little cul de sac off the internet super highway I have not seen much of the google bot. I have however seen the MSNBot crawl every page in my site every month for quite a while.

Still I get the majority of my offsite referrers from Google either normal or image search.

Just to contradict what y'all are saying...

Muskie

BillyS

WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25897 posted 11:23 am on Sep 26, 2004 (gmt 0)

Just to contradict what y'all are saying...

If google never visits you, then how are your pages in its index? Obviously it has all your pages and no need to revisit.

Muskie

10+ Year Member



 
Msg#: 25897 posted 3:14 pm on Sep 26, 2004 (gmt 0)

My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?

Yet all I see is MSN Bot and a variety of browsers in my top agents list... Maybe Google Bot visited a little this month as Web Analysiser only shows the top ten agents, but the fact remains for my website, sample size of one, GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site.

Muskie

BillyS

WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25897 posted 3:49 pm on Sep 26, 2004 (gmt 0)

My site has been mostly stagnant since I've gone back to school. However if, Google was re-indexing and deep crawling sites, surely it would crawl mine, being that I even manage to be number 1 for my own miniscule and unimportant keyword?

Lots of folks can rank #1 very quickly (nearly overnight) for uncompetitive keywords, I've done it many times. Not sure of your point there. Do you think that just because you rank #1 for a keyword that your site is important to Google? I've never gotten that impression from what I've read.

BillyS

WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25897 posted 4:12 pm on Sep 26, 2004 (gmt 0)

Boxes are all fixed low cost, however bandwidth is still relatively expensive, especially when you require THAT MUCH - 15 Gigabits sustained over period of time - this is very high load and its not cheap. Google's normal internal SLA is to crawl it all over few months, so costs of doing so are far lower - they only need 166 Mbits to achieve same results (1500/90).

You mean to tell me that Google would spend nearly a half billion dollars on 100,000 servers but go on the cheap with bandwidth?

OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...

I'm also not sure why you think that hosting 100,000 servers is all "fixed low cost." If your right, their electic bills alone are in the $5 million per month range (the servers plus cooling), let alone the cost to maintain all those machines.

Lord Majestic

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 4:18 pm on Sep 26, 2004 (gmt 0)

OC768 - 39.8 Gbps and all this stuff has burst speeds. They don't even need a pipe that big with multiple centers...

BillyS - the arguement is not about what pipes they have and how much it costs - empirical and historical evidence suggests that indexers are not bottlenecks, and its the crawlers are that take longer to execute. Whatever the exact speed is irrelevant as I was trying to answer original question, which if you don't mind me quoting is as follows:

Because no index needs to be built up, the crawl is much faster than usual.

If crawling is the bottleneck (which I think is the case) then this original question is answered as "no, not running indexers won't make any difference on speed crawling provided there are enough URLs to crawl". This is all I was argueing about, I entered in to possible specifics of what infrastructure they use to show what the most likely bottleneck is.

And believe it or not - even with billions dollars there is always a bottleneck simply because one subsystem runs faster than the other for whatever reason.

walkman



 
Msg#: 25897 posted 5:47 pm on Sep 26, 2004 (gmt 0)

still running hard...days after a major update. GBot is on crack.

atlrus

10+ Year Member



 
Msg#: 25897 posted 8:37 pm on Sep 26, 2004 (gmt 0)

All this is just speculations.
I have a #1 website which cache is from August 12th, and another one in the top 15 with cache of September 6th.

ncw164x

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 8:51 pm on Sep 26, 2004 (gmt 0)

Well now the feeding frenzy is over and the ladies have put their handbags down :)

My site has been mostly stagnant since I've gone back to school

Hey Muskie, when was the last time you updated your site, if you give googlebot something to chew on she will return time after time, msnbot and yahoo are also spidering very hard and have been for quite some time.

GoogleBot is not "running hard" and in fact month in month out, MSN Bot has made it a point to completely crawl my site

Belive me googlebot is running very fast indeed, not seen this type of spidering since the google dance ended and the update was spread out over the month as a continuous update instead of the last 7 - 10 days of each month, too early to tell what the outcome will be but I am sure we will all find out in the very near future.

Hanu

10+ Year Member



 
Msg#: 25897 posted 6:56 pm on Sep 27, 2004 (gmt 0)

Well now the feeding frenzy is over and the ladies have put their handbags down

LOL. Sorry 'bout that ... ;)

Filipe

10+ Year Member



 
Msg#: 25897 posted 9:24 pm on Sep 27, 2004 (gmt 0)

For all those people who say that Google ignores the newbies, my new site has had 4,000 of its pages crawled since Gbot starting going nuts.

sasha

10+ Year Member



 
Msg#: 25897 posted 4:20 am on Sep 28, 2004 (gmt 0)

Aside from an increase in spidering, has anybody seen any meaningful increase in the number of pages shown on their Google index: ie, site:www.domain.com, especially for partially indexed sites?

In my case Google spidered around 25,000 pages out of total 50,000, however our Google index has increased from about 7,000 to about 8,0000. Roughly...

dvduval

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 4:26 am on Sep 28, 2004 (gmt 0)

I'm seeing small increases like you sasha, but nothing sizeable.

Patient

10+ Year Member



 
Msg#: 25897 posted 6:56 am on Sep 28, 2004 (gmt 0)

Hi Sasha

When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?

I also have a site where google "loses" a percentage of indexed pages every month but despite the fact that it visits my site >50000 times a month I am not convinced that it visits every individual page.

I am going to analyse my logs in more detail.

sasha

10+ Year Member



 
Msg#: 25897 posted 7:11 am on Sep 28, 2004 (gmt 0)

> When you say that googlebot visited 25000 pages do you mean it visited your site 25000 times or do you know that it visited 25000 seperate pages?

Actually, it visited probably, like, 100 times in the span of 2 weeks. It produced 25K hits to all pages. I would say judging by the logs about 15-20% of the hits are duplicate (ie. hitting robots.txt 200 times per day, or index.html about 20 times), with the rest of the hits being unique.

PhraSEOlogy

10+ Year Member



 
Msg#: 25897 posted 7:29 am on Sep 28, 2004 (gmt 0)

Bots on one site this month

MSN 88,000+ pages
Jeeves 88,000+ pages
Google 15,000 pages

Google comes in third place.

Strange thing is that another site that lost 90% of its traffic from google gets crawled like crazy - every day - Go figure!

Powdork

WebmasterWorld Senior Member powdork us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25897 posted 7:45 am on Sep 28, 2004 (gmt 0)

I mentioned this in another thread but it seems to belong here. In a response to an email i sent regarding sandboxing, Google replied
Our crawler analyzes the content of webpages in our index to determine the search queries for which they're most relevant.
I'm guessing that says something about the relationship betwixt crawler and indexer, although I don't know what.
Patient

10+ Year Member



 
Msg#: 25897 posted 8:01 am on Sep 28, 2004 (gmt 0)

In my case googlebot feels the need to visit each of my pages at least 10 times per month - at this rate it would need to visit my site >150,000 times and not the 50,000 it does at present.

I am beginning to wonder whether there may be a connection between this and the "missing pages" syndrome.

Liane

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 11:23 am on Sep 28, 2004 (gmt 0)

In the last 24 hours Googlebot reindexed my entire site for the second time this month and that doesn't include the daily hits in between. I'm still not seeing any major shifts on much of anything.

I still believe they are having problems and are frantically trying to sort things out. I'm having a hard time believing there is any other reason for two deep crawls without a major update.

DaveN

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 11:28 am on Sep 28, 2004 (gmt 0)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

don't think that one is your friend!

Patient

10+ Year Member



 
Msg#: 25897 posted 11:38 am on Sep 28, 2004 (gmt 0)

Hi DaveN

What do you think is "unfriendly" about this bot?

macdave

10+ Year Member



 
Msg#: 25897 posted 11:50 am on Sep 28, 2004 (gmt 0)

If this activity is indeed an emergency rebuild of Google's index, I suspect it's related to the the meta-refresh page hijacking problem being discussed ([webmasterworld.com ])

From recent posts in that thread it appears that the redirect/hijack problem is being fixed. The root of the problem was page contents being "credited" to redirected URLs: in that case, the index may not have contained the data necessary to re-link hijacked content with its proper URL, necessitating a complete crawl to rebuild the index.

DaveN

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25897 posted 11:51 am on Sep 28, 2004 (gmt 0)

it's just not the bot you want to see if do aff marketing.... just a feeling and a few tests ;)

BillyS

WebmasterWorld Senior Member billys us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25897 posted 11:52 am on Sep 28, 2004 (gmt 0)

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

I've seen this in my logs too. The only note I've seen on it was a confirmation that it was Googles, starting I think around a month ago they started using it.

sri_gan

10+ Year Member



 
Msg#: 25897 posted 1:50 pm on Sep 28, 2004 (gmt 0)

Hi,

I've been monitoring this issue from yesterday evening.

I doesn't look like Google or MSN is crawling this way.

Some other network is trying to flood the websites with the name of Google and MSN.

The reason I'm saying this is because of the pattern the hits are Generated which is similar with both user agents.

Google Never uses the Below User Agent:

Google User Agent:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Ip Range with Subnet:

66.249.66.0 255.255.255.0
66.249.65.0 255.255.255.0

Today some of my websites are pounded with hits by

MSN User Agent:
msnbot/0.3 (+http://search.msn.com/msnbot.htm)

207.46.98.0

Any Network Admins find the same Pattern.

whiterabbit

10+ Year Member



 
Msg#: 25897 posted 1:57 pm on Sep 28, 2004 (gmt 0)

sri_gan

66.249.66.0
66.249.65.0

Belong to Google

207.46.98.0

Belongs to MSN

Tony

sri_gan

10+ Year Member



 
Msg#: 25897 posted 2:09 pm on Sep 28, 2004 (gmt 0)

Yes its. But the Request pattern for both the agents are similar, which gives possibility to think that there could be someone masking themselves as MSN or Google.

jnmconsulting

10+ Year Member



 
Msg#: 25897 posted 2:35 pm on Sep 28, 2004 (gmt 0)

I have seen both googlebot and msnbot hitting my site pretty hard. from what I see they are real bots from each company based on IP address ranges. Is it possible to spoof and IP like that?

idf03

10+ Year Member



 
Msg#: 25897 posted 2:52 pm on Sep 28, 2004 (gmt 0)

I'm also getting pounding by this 'googlebot'

An example line:

66.249.65.236 - - [28/Sep/2004:15:48:24 +0100] "GET /tps_page.html HTTP/1.1" 404 335 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

The IP does appear to belong to Google, but whereas googlebot normally spreads it's load across several bot machines, these are all coming from one IP address.

Also they are requesting docs which do not and have never existed on my site. They do appear to be docs which exist on sites I link to. It's as though the bot has followed links but has not changed the server part of the link.

This 176 message thread spans 6 pages: < < 176 ( 1 2 [3] 4 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved