I have to say that I find this significant too. I bumped another thread as I was interested in the lag between crawl and appearing in the index.
64.68.81.* appeared for me on the 10.8 and the pages are now today appearing in the index but only with the title.
I have to admit I'm little all over the place as I am still trying to recover from a server problem a month or 2 ago when I gave Gbot alot of 500 status codes and if I am really honest I don't know if these pages were ever in the index in the first place or not. :0
64.68.82.* came by on the 17.8 and I'm wating for those to appear. Unlike you though 82 did quite an extensive crawl whereas 81 just crawled a small fraction of pages.
I have been wondering the same thing myself ... though if we are going back to the old style update or dance, I see no proof of that.
I added 10 new pages last month. 4 of them got picked up, 6 didn't. The newest of the pages added was one of the first to be added to the index. Go figure? Its anyone's guess how this stuff works. I have no clue!
I'm not sure what this means. It could be that the pages which 81* are crawling are "bad" (or "good") pages.
Who knows...either way, it's a very interesting data point.
I put up all new pages about a week and a half ago... Google came and visited each page (I have been checking the logs) but I have not seen the new URLs in the index. Normally, I have seen pages come up in the index within a few days to a week.
Has it taken longer with anyone else lately?
Same here, but indexed within 36 hours
They haven't made it to my logs yet. :/
Is there some reasonably complete list of bots and their ip's anywhere?
|Has it taken longer with anyone else lately? |
It's been variable. Just went through a re-design on one site, and seeing it take well over 20 days to get some of the new pages in. Very similar to what Liane is describing.
Some sites are popping in under 48 hours, though. I haven't figured out the pattern yet.
One thing that I've noticed however is odd spider behavior. Sometimes a spider will request 20 different pages from 20 different IP address, each time requesting robots.txt after every page request. Sometimes it'll request one random page, then won't come back for 30 hours.
It's very erratic right now.
I think various aspects of the bot are broken. Last month I noticed the bot's inability to follow 301 redirects. It kept crawling the old page, but would not follow the redirect. I have since pulled all redirects off my site and now the bot is starting to crawl normally, albeit rather sporadic and long between crawls. It has only been crawling one page per day, if that. No deep crawl for over a month now - before I deleted the redirects. For a couple of days there, a deep crawl would start and as soon as it hit a page with a redirect, stop and not come back for a day or two and then try the process all over again.
I think there seems to be a problem, at least from what I am observing.
[edited by: webdude at 5:56 pm (utc) on Aug. 19, 2004]
"It could be that the pages which 81* are crawling are "bad" (or "good") pages."
Mmm...I'm not sure either.
The pages that 64.68.81.* crawled were very deep (3 levels from the homepage) and there is a reasonable chance they were not crawled before.
There has been some discussion about Google potentially running out of DocID's, I'm no DB wizard like some on here, but it seems to me a very efficient way to store or allocate them with the page title and little else.
I also noticed from the recent upheaval and discussions that some felt that large directory like sites had been hit by the august update.
It would be great to hear from others who have been visited by 64.68.81.* and the result of the crawl.
I'm still not sure if I want to see 64.68.81.* again. One more point 64.68.82.* never went any where near the pages that 64.68.81.* had crawled seven days earlier.
|I think various aspects of the bot are broken |
I agree various aspects of the bot may be broken. In mid may, google started displaying ALL pages of my two websites with old titles (from 6 months ago) despite showing current cache. In other words, googlebot seems to have difficulty detecting the current title tag. Furthermmore, on my larger site (around 100 pages), the bot is not performing a complete crawl. All of my pages are displayed in google with url only, and the index page is no longer showing in the serps (after 4 years). I hope the old bot is coming back around so that some of these problems may be resolved (assuming they are being caused by a broken crawler).
I'seeing 81 going for my "bad" pages, 301's etc. where as 82 seems to be doing his regular crawl. Although, and now this is weird, I put up a new page today at about 15:00 on a site that is just content, and very good content if you're interested in the Norfolk Broads, and 82 came by at 17:30 to read the page after putting a link from the root.
Do you mean "bad" or "old"?
'Tis my theory that the .81* crawler is doing a typical "deep" crawl, but working from an old data set.
hmm good point - too which I don't have an answer. These are "old" pages for data that no longer exist in the database, which then get 301'd to a db search to try and find a closer match. So they really are "bad" pages!
It could be pages that have had a 301 in the past are being checked again to see if they will get the same response. I'll check thru the logs tommorow and see if that's the case.
Some from 64.68.83.* (18.104.22.168 , ...) and 64.68.81.* (22.214.171.124 , ...) are using HTTP/1.1 and request compressed pages. Is this OK?
I noted two different bots from 64.68.81.*
1) Cleaner 64.68.81.****
126.96.36.199, 188.8.131.52, 184.108.40.206, 220.127.116.11 and so on 64.68.81.xxx going after my 301 (also a 404) and stressing robots.txt a lot.
This seems naturally to me some kind of cleaner bot from google checking for this kind of stuff.
2) Mediapartners-Google/2.1 18.104.22.168 seems to be some Adsense bot here :-)
Servers for different tasks mixed up here in same ip range ('ve seen this mix in other very big farms sometimes for randomly reasons from rack to rack too).