Forum Moderators: open
I uploaded two new sites at roughly the same time, early April. One has been indexed well, and getting recrawled with new pages getting indexed somewhat frequently. The other one Google is pretty much ignoring. One difference I have noticed is that the one that is well indexed is getting regular visits from both crawlerxx and crawlxx. Please correct me if I'm wrong, but I believe crawlerxx is freshdeepbot.
The one Google is pretty much ignoring is getting visits from only crawlxx (crawl31, specifically), and that may only be recently since I wasn't paying much attention until now. It's not going any deeper than the index page either. It's showing a TB PR of 0, with 2 backlinks both PR5, but doubt that it could have any sort of penalty. TB PR doesn't seem to be very accurate or make much sense now anyway so I'm not concerned about that. It gets absolutely no referrals from Google either. All internal pages are greyed out.
Does anyone know what crawlxx's function is? Is it maybe just finding new sites/pages just to then tell freshdeepbot to come and visit longer and deeper? Or am I reading too much into this? The ignored site doesn't have as many links yet, maybe that's the only reason crawlerxx hasn't visited and why it's not been crawled any deeper.
I hope this isn't sounding too much like a 'why isn't google indexing my site' post. I'm really more interested in the differences between the two googlebots.
With Google apparently moving into a new era of continuous updating, exactly what the function of these specific bots now are is something I wouldn't even hazard a guess. The reason you got no answer before may be nobody is sure they know.
Pages crawled with crawlxx (64.68.82.xx) show up later without a freshtag.
My guess is the update as we know it has been devided into more independent processes:
- freshbot as usual
- deepbot pages added/updated more often (weekly?)
- backlink/PR recalculation as usual (every so many weeks)
- algo tweaks and spam filter runs when they feel like it
All these processes seem to occur independently nowadays.
We could still call the recalculation of backlinks and algo changes 'updates', I don't think just adding and refreshing pages deserve the name anymore :)
Since the last update was about the 15th I think, I am expecting a PR reclalculation and backlink update and day now.
To backup my authority on the matter - I am claiming the first post regarding continuous updates way back on April 28: [webmasterworld.com...]
Sometimes even newbies can get it right.
/stupid gloating
Regarding updates I did notice a little dance yesterday of one of my files which moved position and back again. I have seen no changes in PR but monthly updates seem a sensible conjecture.
Is there anyone else that has only had either crawlerxx or crawlxx at their site? That may help to narrow down what each of them do. Crawlxx doesn't seem to be living up to its name (or former name) of deepbot, as it's not going deep into mine at all.
As HitProf says, crawlerxx still seems to behave like Freshbot though, but going deeper and living up to its new name of Freshdeepbot.
I guess I'll just have to be patient and see what happens.
What I have noticed is that "crawl" is rarely visiting me. It seems "crawler" is replacing "crawl".
I wish I could get crawlerxx to come by the one site of mine, still just getting crawlxx, and it's not going any deeper than the index page.
I'm convinced that its not been penalized now though, I did a search using a unique phrase from the site, and it did show up in Google. I'm just not getting any regular referrals from Google because the index page is just too general, the internal pages would need to be indexed for people to find them in searches.
>>Do the http_accept headers you mentioned not always match up consistently with either crawlerxx or crawlxx?
Nothing is absolute with Googlebot. crawlxx primarily uses the "text/html,text/plain,application/*" accept header (it very rarely uses the "text/html,text/plain" accept header) but crawlerxx is more mixed and may use either header.
They spidered every single page on the site, which had around 60 pages, some of which were 3 levels down in the directory structure. We do have a comprehensive site map linked from the home page, though. Oh, and all of our site is spider-friendly (no querystrings or session ID's, just plain old .html files, even though the site is all PHP+MySQL).
I guess this stands against the speculation that crawler* might be freshbot and crawl* deepbot or deepfreshbot, or whatever. I think Google's new generation crawlers are completely interchangeable: freshbot can become deepbot and vice versa, as needed.
In the following weeks I have been only seeing bots in the same IP range (crawler*.googlebot.com), requesting just the URLs linked to from the home page. The pages showed up in the index with a date tag by the next day or two.
Today I got my first hit from a 64.68.85.* bot (crawl*.googlebot.com), which requested
/robots.txtand the home page at around 19:20 UTC.
The beginning of a deep crawl? I don't think so, even though I really hope our site gets deep-crawled again before the next update (we added about 450 new pages lately). ;-)