| This 54 message thread spans 2 pages: 54 (  2 ) > > || |
|Mozilla Googlebot Crawling Deep and Fast|
G Storm-Surge has presaged SERPs Updates in the past
The is some slight evidence from my site, and also from some others [webmasterworld.com], that G is beginning extensive-, deep- and fast-crawls of websites across the board.
This kind of activity has preceded a SERPs Update in the recent past [webmasterworld.com].
The G Mozilla-bot (the one that mostly carries out this kind of behaviour) is restricted in speed on my site, so I find it difficult for an accurate appraisal. Any other confirmation/denials?
Not sure myself, but I don't think the PR update we've been looking for in february has happened yet...and I'm impatient for it :)
|Jordo needs a drink|
Probably more likely that they turned the Google M bot up a few notches now that more and more people are talking about page number differences between the normal data servers and the Big Daddy servers.
Matt Cutts has pretty much stated that the PR update will come after the Big Daddy update and Big Daddy isn't supposed to be finished until March....
The last time this connection was documented, the first sightings of fast + deep crawls were in early July 2005, with an increasing number of reports throughout August, whilst the update was 20/21st Sept. That should meet Matt Cutts's timetable nicely.
I'm not seeing any particular change, which is to say that Googlebot is taking approximately the same number of thousands of pages every day.
Back when there was an update every month and Google did a deep crawl a week or so ahead of time, watching the spiders was fun and there was a point to it. But if you're saying that the spidering fits a pattern that may have the update occurring a month or two more from now then I have to ask, Is this predictive at all? Or is our time better spent obsessing over datacenters?
They may just be trying to solidify the BD index.
I have noticed that there are some pages from my sites that were #1 and have now disappeared.
There is no question of a penalty just that the index is incomplete.
They may also be missing a lot of new content as it is taking very much longer than normal to get new pages listed
|... may have the update occurring a month or two more from now then I have to ask, Is this predictive at all? |
Exactly what I am trying to find out!
Early config errors on my site have cost me dearly with Google (>50% drop overnight at G-Update, more than once) with a *very* slow recovery following fixture. That history, plus the "Here be dragons" nature of Google mapping-topography for webmasters, cause me to treat any hard facts on the current G-nature as gold-dust.
The original title to this thread (the mod changed it) was expressed as a question. The thread is intended as a call for evidence. Are other webmasters experiencing fast, or deep, or extensive G-Bot crawls? If the answer is increasingly "Yes", and history repeats itself, then we have another smidgen of truth about G and it's nature.
If you know that the storm is coming, you can batten down the hatches, and perhaps save yourself from sinking.
|I'm not seeing any particular change, which is to say that Googlebot is taking approximately the same number of thousands of pages every day. |
Same here, pretty much. Some days it's just under a thousand pages and other days it's over 2,000, but it's been a while since I've noticed anything dramatically outside that range.
Trying to predict future behavior of Google from past experience is going to be futile, in my opinion. For one thing, Big Daddy is a new infrastructure. They are still working on it, but when it's fully rolled out, I'm sure it will mean other changes from any past routines we've observed.
And my best guess on the current Moz Gbot crawl (I see it too) is that it is helping to get Big Daddy ready for prime time, not to introduce a new factor into the algorithm.
I had a very aggressive deep crawl on one of my sites today
|Are other webmasters experiencing fast, or deep, or extensive G-Bot crawls? |
No, not for me.
|They may also be missing a lot of new content as it is taking very much longer than normal to get new pages listed |
Very much agree with this and it would be great to get confirmation on this from others.
I have looked at a handful of sites on BigDaddy and non-BigDaddy. The sites I looked at are not my own sites and were unrelated to each other. The page counts varied greatly from non-BigDaddy to BigDaddy. For some sites, some counts are higher on BigDaddy and others are lower.
Google is all over the place with this move. With so many pages missing, there will be wild fluctations with results. A good heave dose of crawling is needed and welcomed to sort out this mess.
I'm seeing some PageRank updates on one of my newer sites - pages that had no pagerank at all before are now showing up with PR 2 or 3. However, these results seem to come and go, so I think a pagerank update might be under way at the moment?
I'm seeing major flux in my sector. With the deep crawls and PR movement, I think a groundbreaking update could be at hand.
I too had realy deep (over 7 thousand pages) G-crawl on one of my sites on 10th of Feb.
I work as an independant web designer nowadays and yesterday I got several calls from people who had received the Google spam warning email ... saying they were going to get a 30 day ban. I think it may have been as a result of the deep crawl around the 10th Feb.
I'm not complaining, it seems like it's bringing me a load of work.
All the Best
Never heard before google would send such mails.
Plz explain in more detail.
This subject cropped up late in January and has been discussed a bit in other forums. Basically Google catch you out and send you an email as per below:
"Dear site owner or webmaster of yoursite.com/,
While we were indexing your webpages, we detected that some of your pages were using techniques that were outside our quality guidelines, which can be found here: [google.com...]
In order to preserve the quality of our search engine, we have temporarily removed some webpages from our search results. Currently pages from yoursite.com/ are scheduled to be removed for at least 30 days.
Specifically, we detected the following practices on your webpages ..."
.... And then it goes to to quote the faults. Very useful eh?
All the best
Had an extremely deep crawl on one of our sites soo. Almost 700,000 pages so far this month.
What were the faults they pointed out?
Repeated phrases, invisible text ... the usual amateur stuff.
All the Best
Starting to look interesting, is it not?
Some more detail would be useful. eg
- Which G-Bot? ('normal' G-Bot, Mozilla-Bot (M-Bot), Adsense-Bot (A-Bot), Image-Bot (I-Bot)) (see also below)
- Dynamic or HTML site?
- How deep? (number of pages does not really help, and can look like bragging (!); so, "an 80% increase in pages taken" would be more helpful in that context)
- Do not forget dates
Discriminating between G-Bots: If you use AWStats, making the following changes to robots.pm will give individual stats for the different bots:
It will then be necessary to remove that month's db file (assuming that your raw logfiles contain full stats for that month).
|# 2005-06-25 googlebot changed to ^googlebot\/ + googlebot added to RobotsSearchIDOrder_list1 |
# + to distinguish between HTTP/1.0 (former, old) and HTTP/1.1 (new, Mozilla/5.0)
# + bots (different beasts)
'^googlebot\/', # must be before googlebot
'googlebot\-image', # must be before googlebot
'googlebot\-mobile', # must be before googlebot
'^googlebot\/','Googlebot HTTP/1.0 (google.com/bot.html)',
'googlebot\-mobile','Googlebot-Mobile (Nokia6820 google.com/bot.html)',
'googlebot','Googlebot HTTP/1.1 (Mozilla/5.0 google.com/bot.html)',
Dynamic sites: It is at times like this that PHP & other sites miss the Content-Negotiation provided as standard by webserver software for static HTML pages (in practical terms, an unchanged dynamically-produced page appears brand-new, and the bots re-request it). If a PHP-site, have a look at this thread for a Content-Negotiation Class [webmasterworld.com]. Easy to implement, and will reduce bandwidth + server-load. Some support on the Class is also available on this and following pages [webmasterworld.com].
I've just applied for AdSense account, and the site I've submitted received an ultra-fast deep crawl from Mozilla Googlebot - about 240 pages in 3 minutes. Amazing! :)
At 06:26:28 on 16 Feb the M-Bot began to go mad on my site.
In January it took 60 pages, 69 in December, and very similar numbers in previous months. Then yesterday it took 515 pages up until 21:56:36, when the site Slow-Scaper block [webmasterworld.com] kicked in (set at 1,000 pages), and the bot started getting 503's (it was sharing the IP with the A-Bot--also going mad--and they both got blocked). I recently implemented [webmasterworld.com] the Retry-After header for 503 Responses, but both bots have ignored it.
Here is an edited selection of timings, all for the G Mozilla bot only (the last is a 503, all others are 200, 304 or 301):
|220.127.116.11 - - [16/Feb/2006:00:20:47 +0000] "GET /mfcs.php?mid=25 HTTP/1.1" 200 12038 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" In:107461 Out:12038:11pct. |
18.104.22.168 - - [16/Feb/2006:05:23:00 +0000]
22.214.171.124 - - [16/Feb/2006:06:26:28 +0000]
126.96.36.199 - - [16/Feb/2006:06:26:30 +0000]
188.8.131.52 - - [16/Feb/2006:06:26:32 +0000]
184.108.40.206 - - [16/Feb/2006:06:26:33 +0000]
220.127.116.11 - - [16/Feb/2006:06:26:34 +0000]
18.104.22.168 - - [16/Feb/2006:21:56:32 +0000]
22.214.171.124 - - [16/Feb/2006:21:56:34 +0000]
126.96.36.199 - - [16/Feb/2006:21:56:35 +0000]
188.8.131.52 - - [16/Feb/2006:21:56:36 +0000]
184.108.40.206 - - [16/Feb/2006:21:56:38 +0000] "GET /mfcs.php?mid=111&nid=7475 HTTP/1.1" 503 162 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" In:191 Out:144:75pct.
# egrep -c "66\.249\.66\.147 - - \[16/Feb(.*) 0 (.*)Mozilla/5.0" access_log
Quick update now that I have yesterday's AWStats:
"Normal" A-bot crawls on my site are ~25,000/month (no change).
"Normal" G-bot crawls on my site are ~1,000/month (no change).
"Normal" M-bot crawls on my site are ~50/month (huge increase).
|Google AdSense: 14,656+53 |
Googlebot HTTP/1.1 Mozilla/5.0: 1532
Googlebot HTTP/1.0: 545+30
Server busy (503): 29318
The Mozilla-Bot is crawling on more than one IP (of course). Including all the 301s (which never hit the bot-block routine) there are 2,147 page-hits for this bot within the 16 hours up to 4am this morning. Google agreed to slow the M-Bot on my site last July (the GoogleBot took up to 30,000 pages/month before that point), but such agreements seem not to apply during "special" events.
Compared to last Summer, the current stage seems equivalent to that of last August, so an update would be suggested at next March, latest early April.
|Matt Cutts has pretty much stated that the PR update will come after the Big Daddy update and Big Daddy isn't supposed to be finished until March.... |
MLHmptn, when/where did Matt state that the PR update will come "after the Bigdaddy update"?
Last thing I can remember him saying is:
Update on the Mozzie-bot crawl-fest on my site.
It seems to have passed it's peak, although is still going. So far in Feb it has taken 6,820 pages (200, 301 or 304) (normal rate 50/month). There have been an astonishing 54,018 503 Server-busy [webmasterworld.com] hits in addition, although the majority (79%) of these have been the Adsense-bot.
|# egrep -c "Feb\/2006(.*) 0 (.*)Mozilla\/5.0 \(compatible; Googlebot\/2\.1" access_log* |
# ls -al access_log*
-rw-r--r-- 1 root root 30589220 Feb 21 09:29 access_log
-rw-r--r-- 1 root root 101600276 Feb 19 04:05 access_log.1
-rw-r--r-- 1 root root 101421986 Feb 12 04:05 access_log.2
-rw-r--r-- 1 root root 97912241 Feb 5 04:06 access_log.3
-rw-r--r-- 1 root root 99254483 Jan 29 04:05 access_log.4
MC did in fact say that he expects PR update to come after BD rollout, sometime around mid March.
I think there's a BL update going on, but as for PR, I think its in the imagination of many, as they see PR jumping all over the place, whilst some obvious testing is going on. It will come, but not quite yet.
As for the bots,yes there appears to be activity from GoogleBot, Mozilla 4.0 and 5.0, and Mozilla dead-link checker.
Yet another thread on (essentially) the same topic [webmasterworld.com]:
|Today i got crawled very hard, fast, deep ... I'd say, let Google ravage your site as much as it wants! |
Hmm. Not quite my attitude.
..that was my quote..
i guess for a small site like mine, getting a good, hard fast poke by Google is no big deal.
But i can imagine on a large site, it could become an issue..
I'll shutup and crawl back in my hole now..
| This 54 message thread spans 2 pages: 54 (  2 ) > > |