homepage Welcome to WebmasterWorld Guest from 174.129.74.186
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

    
Return of Freshbot and Deepbot?
64.68.82.* and 64.68.81.* crawlers acting different
Seattle_SEM




msg:203207
 9:36 pm on Aug 18, 2004 (gmt 0)

I think this is interesting and important. If you review your log files, and look at the crawls from this IP range:

64.68.82.*
This is the "normal" fresh-bot. He comes and get pages, and shoves them into the index within a day or two.

64.68.81.*
This is the crawler I do not understand. He is acting like the old "deepbot", requesting approximately 2X the pages of the "freshbot". Also, the content he has crawled, at least on a few of my sites, has not showed up in the live index.

Are we going back to regular updates from "deep" crawls?

 

tantalus




msg:203208
 10:12 am on Aug 19, 2004 (gmt 0)

I have to say that I find this significant too. I bumped another thread as I was interested in the lag between crawl and appearing in the index.

64.68.81.* appeared for me on the 10.8 and the pages are now today appearing in the index but only with the title.

I have to admit I'm little all over the place as I am still trying to recover from a server problem a month or 2 ago when I gave Gbot alot of 500 status codes and if I am really honest I don't know if these pages were ever in the index in the first place or not. :0

64.68.82.* came by on the 17.8 and I'm wating for those to appear. Unlike you though 82 did quite an extensive crawl whereas 81 just crawled a small fraction of pages.

Any ideas?

Liane




msg:203209
 11:16 am on Aug 19, 2004 (gmt 0)

I have been wondering the same thing myself ... though if we are going back to the old style update or dance, I see no proof of that.

I added 10 new pages last month. 4 of them got picked up, 6 didn't. The newest of the pages added was one of the first to be added to the index. Go figure? Its anyone's guess how this stuff works. I have no clue!

Seattle_SEM




msg:203210
 2:36 pm on Aug 19, 2004 (gmt 0)

tantalus,
I'm not sure what this means. It could be that the pages which 81* are crawling are "bad" (or "good") pages.

Who knows...either way, it's a very interesting data point.

Kbug44




msg:203211
 4:06 pm on Aug 19, 2004 (gmt 0)

I put up all new pages about a week and a half ago... Google came and visited each page (I have been checking the logs) but I have not seen the new URLs in the index. Normally, I have seen pages come up in the index within a few days to a week.

Has it taken longer with anyone else lately?

GranPops




msg:203212
 4:15 pm on Aug 19, 2004 (gmt 0)

Same here, but indexed within 36 hours

htohlsen




msg:203213
 4:56 pm on Aug 19, 2004 (gmt 0)

They haven't made it to my logs yet. :/

Is there some reasonably complete list of bots and their ip's anywhere?

bakedjake




msg:203214
 4:57 pm on Aug 19, 2004 (gmt 0)

Has it taken longer with anyone else lately?

It's been variable. Just went through a re-design on one site, and seeing it take well over 20 days to get some of the new pages in. Very similar to what Liane is describing.

Some sites are popping in under 48 hours, though. I haven't figured out the pattern yet.

One thing that I've noticed however is odd spider behavior. Sometimes a spider will request 20 different pages from 20 different IP address, each time requesting robots.txt after every page request. Sometimes it'll request one random page, then won't come back for 30 hours.

It's very erratic right now.

webdude




msg:203215
 5:54 pm on Aug 19, 2004 (gmt 0)

I think various aspects of the bot are broken. Last month I noticed the bot's inability to follow 301 redirects. It kept crawling the old page, but would not follow the redirect. I have since pulled all redirects off my site and now the bot is starting to crawl normally, albeit rather sporadic and long between crawls. It has only been crawling one page per day, if that. No deep crawl for over a month now - before I deleted the redirects. For a couple of days there, a deep crawl would start and as soon as it hit a page with a redirect, stop and not come back for a day or two and then try the process all over again.

I think there seems to be a problem, at least from what I am observing.

[edited by: webdude at 5:56 pm (utc) on Aug. 19, 2004]

tantalus




msg:203216
 5:54 pm on Aug 19, 2004 (gmt 0)

"It could be that the pages which 81* are crawling are "bad" (or "good") pages."

Mmm...I'm not sure either.

The pages that 64.68.81.* crawled were very deep (3 levels from the homepage) and there is a reasonable chance they were not crawled before.

There has been some discussion about Google potentially running out of DocID's, I'm no DB wizard like some on here, but it seems to me a very efficient way to store or allocate them with the page title and little else.

I also noticed from the recent upheaval and discussions that some felt that large directory like sites had been hit by the august update.

It would be great to hear from others who have been visited by 64.68.81.* and the result of the crawl.

I'm still not sure if I want to see 64.68.81.* again. One more point 64.68.82.* never went any where near the pages that 64.68.81.* had crawled seven days earlier.

crobb305




msg:203217
 5:33 am on Aug 20, 2004 (gmt 0)

I think various aspects of the bot are broken

I agree various aspects of the bot may be broken. In mid may, google started displaying ALL pages of my two websites with old titles (from 6 months ago) despite showing current cache. In other words, googlebot seems to have difficulty detecting the current title tag. Furthermmore, on my larger site (around 100 pages), the bot is not performing a complete crawl. All of my pages are displayed in google with url only, and the index page is no longer showing in the serps (after 4 years). I hope the old bot is coming back around so that some of these problems may be resolved (assuming they are being caused by a broken crawler).

C

why2kit




msg:203218
 1:08 am on Aug 21, 2004 (gmt 0)

I'seeing 81 going for my "bad" pages, 301's etc. where as 82 seems to be doing his regular crawl. Although, and now this is weird, I put up a new page today at about 15:00 on a site that is just content, and very good content if you're interested in the Norfolk Broads, and 82 came by at 17:30 to read the page after putting a link from the root.

Seattle_SEM




msg:203219
 1:38 am on Aug 21, 2004 (gmt 0)

why2,
Do you mean "bad" or "old"?

'Tis my theory that the .81* crawler is doing a typical "deep" crawl, but working from an old data set.

why2kit




msg:203220
 2:20 am on Aug 21, 2004 (gmt 0)

hmm good point - too which I don't have an answer. These are "old" pages for data that no longer exist in the database, which then get 301'd to a db search to try and find a closer match. So they really are "bad" pages!

It could be pages that have had a 301 in the past are being checked again to see if they will get the same response. I'll check thru the logs tommorow and see if that's the case.

axa504




msg:203221
 5:03 am on Aug 21, 2004 (gmt 0)

Some from 64.68.83.* (64.68.83.1 , ...) and 64.68.81.* (64.68.81.36 , ...) are using HTTP/1.1 and request compressed pages. Is this OK?

xcomm




msg:203222
 6:02 am on Aug 21, 2004 (gmt 0)

I noted two different bots from 64.68.81.*

1) Cleaner 64.68.81.****
64.68.81.152, 64.68.81.155, 64.68.81.182, 64.68.81.194 and so on 64.68.81.xxx going after my 301 (also a 404) and stressing robots.txt a lot.
This seems naturally to me some kind of cleaner bot from google checking for this kind of stuff.

2) Mediapartners-Google/2.1 64.68.81.28 seems to be some Adsense bot here :-)

Servers for different tasks mixed up here in same ip range ('ve seen this mix in other very big farms sometimes for randomly reasons from rack to rack too).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved