homepage Welcome to WebmasterWorld Guest from 54.167.138.53
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Recent Spidering Patterns
What are yours since the end of the update?
taxpod




msg:163793
 5:42 pm on Sep 12, 2003 (gmt 0)

I'm curious to see what folks' experiences have been with spidering since the old Google Dance regime ended.

Used to be I'd see the deepcrawl in the first 10 days of the month with about 50K pages pulled down, then freshy would come around every 4 - 5 days for a few thousand pages.

Since the changes, I see Gbot come every other day for maybe 10K pages and then on the off days, she comes for just a few hundred. Very rarely do I have a full 24 hours when she doesn't even visit me. This pattern was steady for a few months and then this week Gbot stayed on my site for 3 straight days pulling 15-20K pages each day. When she was gone, she had pulled down more than 50K, similar to the old deepcrawl.

Then yesterday, an off-day, she pulled down about 1,000. Today she is back and starting up the heavy stuff again, having already pulled down 10K pages today.

I'm not wondering about IPs of the bots but rather if others have seen a similar pattern or perhaps a different one. I apologize if this is already posted elsewheres in another thread. I didn't see it.

 

plasma




msg:163794
 6:05 pm on Sep 12, 2003 (gmt 0)

Same here.
In addition I can say that high PR Pages got pulled much earlier and are still being crawled right now.
But although google traffic was higher than the average it seems more steady in general.
On the other hand, maybe just our SEO techniques improved :)

pendanticist




msg:163795
 7:21 pm on Sep 12, 2003 (gmt 0)

Can't say as I've noticed this pattern before:

64.68.86.9 - - [12/Sep/2003:11:38:55 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:38:55 -0700] "GET /Oceanography.html HTTP/1.0" 200 16664 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:39:45 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:39:45 -0700] "GET /Ecology.html HTTP/1.0" 200 13935 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:40:39 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:40:39 -0700] "GET /Political_Science.html HTTP/1.0" 200 10552 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:41:24 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:41:25 -0700] "GET /Libraries.html HTTP/1.0" 200 14559 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:04 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:04 -0700] "GET /Criminology_Drug-Awareness.html HTTP/1.0" 200 12065 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:42 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:42:43 -0700] "GET /Encyclopedia.html HTTP/1.0" 200 6666 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:43:18 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:43:19 -0700] "GET /Search-Engines_General.html HTTP/1.0" 200 13121 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:44:01 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:44:01 -0700] "GET /Government.html HTTP/1.0" 200 18127 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:44:30 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:44:31 -0700] "GET /Legal_Search.html HTTP/1.0" 200 12243 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:45:01 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:45:02 -0700] "GET /Earth-Space.html HTTP/1.0" 200 11947 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:45:38 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:11:45:39 -0700] "GET /Botanical_M-Z.html HTTP/1.0" 200 14585 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

Each machine calls for 'robots.txt' before the file.

Yet, just short of an hour earlier:

64.68.87.43 - - [12/Sep/2003:11:09:22 -0700] "GET /History_A-I.html HTTP/1.0" 200 12744 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:27 -0700] "GET /Law_Schools.html HTTP/1.0" 200 11416 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:29 -0700] "GET /Environmental_A-E.html HTTP/1.0" 200 18969 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:09:30 -0700] "GET /Maps.html HTTP/1.0" 200 7948 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:11:09:53 -0700] "GET /Human_Resource_Management-Bullying.html HTTP/1.0" 200 4873 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

...and just before that.


64.68.87.43 - - [12/Sep/2003:11:07:55 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:11:07:55 -0700] "GET /Radio.html HTTP/1.0" 200 8228 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:07:59 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.79 - - [12/Sep/2003:11:07:59 -0700] "GET /Recycle.html HTTP/1.0" 200 3653 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

...back to tit-for-tat calls.

(Total this week from Google: Near a thousand.)

Seems a bit excessive.

Pendanticist.

jimbeetle




msg:163796
 8:43 pm on Sep 12, 2003 (gmt 0)

I've been trying to understand g-bot and robots.txt for quite awhile and still can't wrap my mind around it. I can almost understand if different machines have to each fetch robots.txt each spidering session but I just can't see why the same machine has to make constant calls for it.

Does this have something to do with G's distributed system and storage capacity? Ain't not engineer but seems like in a well thought out system this wouldn't happen.

>>Total this week from Google: Near a thousand.

Yep, and that's just counting the calls for robots.txt. On a deep crawl it's at least that many per day. Sure am glad Google has all that extra bandwidth to play with.

jdMorgan




msg:163797
 2:06 am on Sep 13, 2003 (gmt 0)

Glad I checked here. I'm getting deep-crawled and I'm seeing the same robots.txt/file.html/robots.txt/file.html alternating pattern - strange.

jimbeetle,
> but I just can't see why the same machine has to make constant calls for it.

There could be multiple machines behind those IP addresses - lots of port addresses available on each IP besides port 80... (If *I* spidered your site from my house, you'd see multiple requests for robots.txt from the same IP address, because all four computers are proxied through the same connection to my ISP.)

Jim

jimbeetle




msg:163798
 3:20 pm on Sep 13, 2003 (gmt 0)

>>There could be multiple machines behind those IP addresses

Duh! As always, you remind me of the obvious that I've overlooked or not thought through. Thanks JD.

One of the things I'm curious about here is that the constant fetching of robots.txt makes G appear to be almost paranoid of violating the protocol, i.e., checking robots.txt each time before fetching a file.

On the other hand, G has a somewhat loose interpretation of robots.txt protocol when it comes to indexing pages that are disallowed -- the "We're really not indexing pages that are disallowed, just pointing out that they exist" reasoning. This forces webmasters to "allow" access to the pages and then use meta robots noindex tags on each to prevent listing in G's index (as JD has previously pointed out).

Makes for somewhat cumbersome site management when faced with this "two-headed" bot: on the one hand its scrupulously repetitive checking of robots.txt which, over time, unnecessarily bleeds bandwidth and, on the other hand, its very narrow interpretation of the robots.txt protocol.

(Got distracted and somewhat lost when writing above, hope it's understandable.)

pendanticist




msg:163799
 3:25 pm on Sep 13, 2003 (gmt 0)

To update:

Overnight the request pattern continued with the IP calling for robots.txt also making the page request.

She rested for a little while, then returned to working even more zealously:

Note: (in one second...)

64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:16:09 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

In one second intervals again...

64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 1879 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 10628 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 20031 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:18:59:06 -0700] "GET /Blahblah.html HTTP/1.0" 200 4185 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

So as to not be left out, this IP Number decides to join the foray...again in one second.

64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:20:09:33 -0700] "GET /robots.txt HTTP/1.0" 200 1493 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

I'm not sure how much of a server drain this causes, but you'd think Google'd at least ride herd a bit better.

Pendanticist.

truth_speak




msg:163800
 3:52 pm on Sep 13, 2003 (gmt 0)

I am a moron when it comes to geek stuff, but here's my 2 cents. My site is small, just 350 pages, and googlebot comes every day to fetch 15-30 pages. I don't know which pages. I wish there was a way for it to just get the pages that have been updated. 90% of the pages the bot requests haven't been modified since last being crawled. And sometimes a bonafide new page takes two weeks to get in the index while plenty of old stuff is refreshed during that time.

jdMorgan




msg:163801
 11:46 pm on Sep 13, 2003 (gmt 0)

I think this robots.txt re-fetching is a bug - something's just not right.

truth_speak,

It is possible that your server is not sending Last-Modified headers. Also, if your server is set up to send very short-term Expires headers, that can cause problems. Another troublemaker is the cache-control header.

If any of these are misconfigured, Google may think you WANT them to check for changes often. And since you have high enough PR to get fresh-botted daily, they will.

Check your server headers here [webmasterworld.com]. Look for the headers named above.

Jim

pendanticist




msg:163802
 1:36 am on Sep 14, 2003 (gmt 0)

>>I think this robots.txt re-fetching is a bug - something's just not right.

One might then wonder why does this anomoly occur sporadically?

...Last-Modified...: Sun, 10 Aug 2003 16:32:52 GMT

...send very short-term Expires headers...:Expires: Sun, 14 Sep 2003 03:16:11 GMT

Another troublemaker is the cache-control header...:Expires: Sun, 14 Sep 2003 03:16:11 GMT

I just took these readings a few minutes ago and I see that both 'cache-control' & 'very short-term Expires' headers expiring within 24 hours and Last-Modified being better than a month ago (<-which seems accurate).

Are these copescetic?

Pendanticist.

jdMorgan




msg:163803
 2:47 am on Sep 14, 2003 (gmt 0)

pendanticist,

Those Expires headers look like a problem. I'd set Expires to half the average frequency of your page updates or one week, whichever is shorter. On the other hand, I set my own images to expire after 90 days, because I almost never, ever change them -- I might change to a new image, but then I'd use a different filename, so that would be cached separately with its own Expires date.

The cache-control header (if present) might say something like "public" or "must-revalidate". If it says, "no-cache" that could be a problem, because the page won't be cacheable at all. If that's not waht you want, it's a problem.

This stuff is controlled by mod_headers and mod_expires in Apache.

Why does it happen sporadically? Because like Windows crashes, it is a software bug (my opinion).

Jim

truth_speak




msg:163804
 3:09 am on Sep 14, 2003 (gmt 0)

jdMorgan,

Thank you for your response. I will trouble shoot my server codes along the lines that you suggest with a tech friend in the near future.

plasma




msg:163805
 2:20 pm on Sep 14, 2003 (gmt 0)

BTW:

link:www.yourdomain.com
and
link:yourdomain.com

differ dramatically. You should try to have all backlinks to the same FQDN.

jimbeetle




msg:163806
 3:36 pm on Sep 14, 2003 (gmt 0)

jdMorgan,

Thanks for the under-the-hood insights.

Looks like my server is not configured to return Expires or cache-control headers. Would configuring these either a) alleviate the Google robots.txt fetching, or b) simply be best practice?

Yidaki




msg:163807
 3:46 pm on Sep 14, 2003 (gmt 0)

jimbeetle, sending a appropriate Expires header at least saves you bandwith. Google wouldn't fetch your pages again until they're changed/updated/expired.

Best pratice, imho.

plasma




msg:163808
 3:49 pm on Sep 14, 2003 (gmt 0)

It is best practice and could speed up browsing a lot.
I suggest you test your site at:

[ircache.net...]

It'll tell you if you're cool or not :)

jimbeetle




msg:163809
 5:14 pm on Sep 14, 2003 (gmt 0)

Thanks folks. If it's best practice then that's what I'll do.

Jim

pendanticist




msg:163810
 7:17 pm on Sep 14, 2003 (gmt 0)

Allow me to add my Thanks. :)

I find it interesting that none of the machines (IP Numbers) summarily requesting robots.txt yesterday are crawling today.

They are:
64.68.86.79 - - [12/Sep/2003:21:57:25 -0700] "GET /Blahblah.html HTTP/1.0" 200 5239 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.54 - - [12/Sep/2003:21:57:34 -0700] "GET /Blahblah.html HTTP/1.0" 200 15997 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.86.9 - - [12/Sep/2003:21:57:37 -0700] "GET /Blahblah.html HTTP/1.0" 200 17064 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"
64.68.87.43 - - [12/Sep/2003:21:58:19 -0700] "GET /Blahblah.html HTTP/1.0" 200 9960 "-" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

While I will certainly take steps to improve cachability, I'm not all together sold on this being the true/total root cause, given today's flawless activity by other machines.

Perhaps GoogleGuy'll grace us with some additional info as to these respective machines past behaviours?

I should point out that all machines roving my files today, are very well mannered requesting them at between six (6) and fifteen (15) second intervals.

Pendanticist.

markus007




msg:163811
 7:38 pm on Sep 14, 2003 (gmt 0)

Looks like there is a update on -cw ww2, ww3 I'm seeing major changes in serps for my categories, + 30% more pages indexed. I doubt its fresh bot cuz -ex has fresh tags from yesterday and the results are different.

twilight47




msg:163812
 3:21 am on Sep 15, 2003 (gmt 0)

FWIW, I see no major changes currently on my end. I did see some different search results totals for www2 & www3.

benc007




msg:163813
 9:37 pm on Sep 15, 2003 (gmt 0)

Taxpod and Pendanticist,

What are using to track GoogleBot?

pendanticist




msg:163814
 9:47 pm on Sep 15, 2003 (gmt 0)

I manually inspect my access_log files daily.

Pendanticist.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved