Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Mozilla Googlebot and the New Index at 64.233.179.104

Moved on from Jagger

         

Dayo_UK

9:58 am on Dec 13, 2005 (gmt 0)



OK - Jagger is over - long live "Big Daddy" - as named by MC for the test DC.

The index growing on 64.233.179.104 does seem to be largely a Mozilla Googlebot generated index - and this new index is being built for the future - so can we say Mozilla Googlebot is now taking over from normal Googlebot.

OK ignore supplimentals etc for a moment - as all DCs have this problem and have a look at the cache dates for pages that are indexed...... some of these pages have only been fetched by Mozzilla Googlebot (even on the same day as normal Googlebot visited)

Eg. On the test DC I have a homepage cached 30th November at 5:40 - fetched by Mozilla Googlebot - while on the other DCs it is cached on 30th November at 3:40 - fetched by normal Googlebot.

So in many ways this does look like building a whole new index parrellel to the existing index - with largely Mozilla Googlebot crawl data.

Some pages appear very old - eg another page is cached on the test dc on 6th November - but on the other dcs it has cache in December - checking the logs - 6th November was the last time Mozilla Googlebot visited this page.

OK - there are pages in the test DC only visited by normal Googlebot - however, pages crawled by Mozilla Googlebot do not appear on other DCs.

The newest pages on the DC crawled by Mozilla Googlebot seem to be in November - eg no pages crawled by Mozilla Googlebot in December have made it to the index yet.

Some pages crawled by Mozilla Googlebot in November have not made it to the index - so I dont know if G are working with a sample data size......

For confirmation that this is a whole new build of the index MC said on his blog:-

"the test data center certainly has some different crawling and indexing characteristics."

OK - folks remember also that MC said that this index will roll out in months and is in a test state so I guess no need for early panic stations and slagging of Google in this thread.

Now 301s, 302s, Canonicals - for me a lot more 301s Google has crawled and indexed correctly. 302s - still lots in the index (mainly supplimentals) - not seeing any new 302s that show the url of the linking site but the content of the destination site (seeing the newest at about August 2005 time) - no doubt others may find some.

What are other observations people have seen with the new crawling and indexing on this test dc.

NoLimits

5:44 pm on Dec 22, 2005 (gmt 0)

10+ Year Member



In response to Dayo -

My site went missing (URL only, then supplemental) at the earliest stages of Jagger.

I really hope I make it through the filter this time around... I've been good. I added meta info to my pages... I removed duplicate words in title

Formerly:
"Domain.com - Article Name"

Currently:
"Article Name"

I just don't know what G wants from me in terms of getting these site penalties removed.

I will be crossing my fingers and watching this Test DC far more than I should be. Hoping for the best - and the best of luck to you all as well.

taps

6:27 pm on Dec 22, 2005 (gmt 0)

10+ Year Member



My pagecount went down from 150.000 (two days ago) to 139.000 (yesterday) to 136.000 (today). I consider this as a good sign - something < 80.000 pages should be realistic for my site. (My site had been hit by dupe penalty Sep 22nd).

However there is still some old crap in the results that with cache dates from Nov 2004.

NoLimits

8:22 pm on Dec 22, 2005 (gmt 0)

10+ Year Member



I wonder what filters are already applied to this test index.

Out of the 60 or so well paying, high traffic articles I have (well... pre-jagger they were anyways) - 1 of them is showing up supplemental.

I see this as a good thing - as the article is a far more competitive sub-topic of my niche than what I typically go for. I wrote it, threw it up... thought it sounded a bit spammy, but left it anyways. Sure enough... supplemental.

I wonder what filter is causing this. It seems that some filters are already in place in the test DC... but which ones.

speedshopping

9:10 pm on Dec 22, 2005 (gmt 0)

10+ Year Member



I currently have a site with 445,000 pages in the datacenters but when i look at the test DC it says I have just 650 pages in the index?

Is there something I am missing or should I expect a massive reduction in traffic?

Any help would be appreciated.

Powdork

9:50 pm on Dec 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is there something I am missing or should I expect a massive reduction in traffic?

Are you getting traffic from Google to those 444,350 pages at this point. If so, then having those pages out of the index would be a problem. Whether or not being out of the test DC now will ever translate into being out of the index when it goes live is hard to tell.

speedshopping

9:57 pm on Dec 22, 2005 (gmt 0)

10+ Year Member



Thanks for responding - we are getting traffic from those other pages which causes a little concern when looking at this Test DC.

Could google really get rid of so many pages, or has the Test DC got a limit on the total number of pages in its index?

RichTC

10:50 pm on Dec 22, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is it just me or do many of you get these emails suggesting three way link exchanges to boost certain sites positions in Google?.

Two sites i know of have directory sites that request links back to the main site. They will offer things like two or three links for one etc.

The main site cleans up with loads of one way links to internal pages etc and the two sites in question doing this are racing to the top of the serps for almost every related keyword in their commercial sector.

In effect its like a link farm type operation, but google cant spot it. I read matts blog and they claim they can spot linking farms and link buying structures of this type but evidence indicates to me that sites doing this using this method are cleaning up - well the two currently actively doing this are.

Meanwhile, early indications at 179.104 show the two sites in question at 1 for everything, so clearly crime does pay!

cws3di

10:53 pm on Dec 22, 2005 (gmt 0)

10+ Year Member




IMO, it is possible there are two very different dup content filters/penalties

1. Similar content within the site
Many here have reported large sites with template pages - although each page may contain something unique, like a product description, the template content and the meta tags are very similar and outweigh the unique content on the page. These sites are reporting that many of their pages are going Supplemental.

2. External websites with "duplicate" content, such as scrapers or content thieves. The more I look around, and read that people are reporting this content still exists around the web, hmmm. I think that these pages may not be going supplemental, but have an overall site penalty applied near the end of the re-indexing. That may be why scrapers show up and last a while, because it takes a bit of time for the algo to spider and re-spider and double check however it makes the judgement, but eventually it catches up. I haven't been able to figure out if the site which originally published the content is also possibly penalized in the end?

.

Dayo_UK

9:33 am on Dec 31, 2005 (gmt 0)



Ok - Test DC data is still not live.

In the other thread on the test DC update it was mentioned that Mozilla Googlebot is an anti spam bot.

I would have agreed with that statement a few months ago - however the fact remains that Mozilla Googlebot is adding pages to the index on the test DC.

So perhaps the purpose of the bot has changed over the last few months - the crawling activity has changed and the bot seems to be crawling in a more hierarchical manner.

zeus

10:19 am on Dec 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Dayo - right on the money, mozilla is not just a spam bot and yes it does crawl like the real googlebot the last month time

reseller

4:31 pm on Dec 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Dayo_UK

>>I would have agreed with that statement a few months ago - however the fact remains that Mozilla Googlebot is adding pages to the index on the test DC.<<

But the questions are still:

- Mostly its "unhealthy sites" that Mozilla Googlebot or what I call "spam detective" visits.

- Wonder whether those added pages you mentioned are ranking at all!

And of course: wish you Dayo_UK a happy "uncanonical" 2006 :-)

Dayo_UK

4:35 pm on Dec 31, 2005 (gmt 0)



Thanks Reseller :) - Happy New Year too.

They rank if the site is ok (eg not suffering from Canonical problems)

I agree that Mozilla Googlebot might be looking at sites that have had problems (not necessarily spam - eg Canonical problems amongst other things.)

BillyS

6:05 pm on Dec 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I tell you, that Bot is trouble... It may be playing multiple roles, but I just took a look at my logs to see if that sneak had been in lately. Sure enough it had requested this URL:

www.foo_foo.com/Help/

Well, I've got nothing on my website with that address and a quick check in Google indicated they haven't ever seen that URL either. So what is that nasty little pest checking for?!?!

Now I must admit that three months ago my website might have returned a 301 that pointed to the closest page on my site - but I've learned that is a Google no-no. Well, my server fed that insect with a 404 and it went away.

Now if it would only go back and give my site the clean bill of health that it deserves...

One thing I know is that 2006 has to be better than 2005 when it comes to me and Google. With 500 referals a month (<-- no typo there) on average since October 1st, I don't think it can go downhill from there. That being said...

May you all be blessed with a prosperous 2006.

WW_Watcher

8:13 pm on Dec 31, 2005 (gmt 0)

10+ Year Member



A Few of my thoughts, Feel free to ponder on them.

All of this is in my humble opinion (IMHO), not presented as fact but based on 15 years of experience working in network engineering & programming, combined with a bit of common sense.

Is it the Bot, Or The Algo?

What have the search engine bots been created for?

Would that be to crawl webpages, following links, and copying the contents of the pages back to somewhere where the Algo can then slice and dice, count links, anchor text, figure word density, word proximity, the content of tags, make a determination on what is spam, and what is not, and then make decisions on how to rank them for display in the Serps?

Or,

The Bot is written to make these decisions before the Algo even gets to look at it. (there is not enough bandwith for this to even be a considerstion)

IMHO (notice I use IMHO)
If this is the new bot that Google is working on to be used to crawl the web in the future, it is an attempt to be better possibly more efficient, at copying the contents of the webpages, and following links, possibly following javascript to find redirects and such(and in this case it might make sense to not index, and even flag a page for no-index for the redirects). It Might be even an attempt to be able to read and follow flash files.

Back To Watching
Newbies, Do Not Confuse The Number Of Posts, With The Quality Of Posts.

WW_Watcher

anttiv

8:54 pm on Dec 31, 2005 (gmt 0)

10+ Year Member



One of my sites had a lot of Googlebot activity on 26th December. The next day the site was penalized for something and I saw a huge drop in traffic. Now almost all Googlebot activity is from the same IP address and yes, it's a Mozilla bot.

Luckily I have an idea of what the problem might be because I can see that it's spidering certain parts of the site. These pages may have caused the penalty because they were php error pages. What is the best way to remove them? Noindex tag or a 404?

moftary

2:32 am on Jan 1, 2006 (gmt 0)

10+ Year Member



I think when there is a mozilla-googlebot-from-one-IP-address activity, then and only then it would be a spam detective that always result in a site penalty.

Dayo_UK

10:29 am on Jan 1, 2006 (gmt 0)



>>>I think when there is a mozilla-googlebot-from-one-IP-address activity, then and only then it would be a spam detective that always result in a site penalty.

Even from the one ip it added pages to the test dc (still not showing test data) - although only a sample of pages in my case.

MC said the test dc has different indexing and crawling charateristic - pretty heavy hint there IMO.

zeus

1:26 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I thought I was getting back to the office and saw, NEW GOOGLE UPDATE, but no, still no update.

zeus

2:49 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



test DC on google.com now, maybe this is the beginning of a new update with a non www fix

taps

4:22 pm on Jan 1, 2006 (gmt 0)

10+ Year Member



I see test results on Google.de too.

Happy new year to everybody!

zeus

5:39 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



now I see we are back to the old troubles non www index on google.com

SEOPTI

5:40 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



back to test.
it's switching non-stop.

zeus

5:48 pm on Jan 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



seoti - I still see the old, but yes it does switch a lot, the question is now for those who had troubles with the non www and are now seing the www version again in the serps, will they get a ranking again.

I think they first place the partly fixed codes/results in the serps and then make the real update to give those lost sites there PR and ranking again.

Dayo_UK

6:29 pm on Jan 1, 2006 (gmt 0)



Hi Zues

Yes, that is the question - G may have fixed some canonical issues - but where is the rank gone for these pages/when does the rank return to these pages.

PR supposed to be updated continually behind the scenes (so I heard/it was quoted) - but sites that have had the canonical problem for a while seem to have had the PR soul ripped out of them.

Hopefully you are correct in the next stage.

mgpapas

7:52 pm on Jan 1, 2006 (gmt 0)

10+ Year Member



I think when there is a mozilla-googlebot-from-one-IP-address activity, then and only then it would be a spam detective that always result in a site penalty.

I had alot of strange activity which I'd never seen before on the 26th including multiple page refreshes, a googlebot that was actually refered from some strange truncated url and exiting to a non-existing page. The next day my site had the penalty
it's had since september lifted.

moftary

10:52 pm on Jan 1, 2006 (gmt 0)

10+ Year Member



Dayo_UK,

Are you sure that your pages indexed by the mozilla-one-ip-googlebot werent indexed before by the old googlebot?

Looks to me like mozilla googlebots retake a look at the pages indexed previously by googlebots and determine their quality.

That's the scenario for me..

- Site is published
- Googlebots crawl the site heavily
- Site pages go into google index and serps
- Thousands of google referrals for a few weeks
- Mozilla googlebot crawl the site heavily again
- A penalty

colin_h

7:24 am on Jan 2, 2006 (gmt 0)



Hi Moftary,

I have seen a similar routine. I had noticed that the incoming links are not being calculated until after the initial inclusion. 10 days or so later when the incoming links are calculated ... Smack! ... Penalty Time.

My main problem is that when sites are old and have been at the top of the serps for many years, they have a load of rubbish incoming links from webwasters who just publish google results to boost their page relevancy. I assume google thinks that I am in league with these sites and slaps me with the penalty ... nothing could be further from the truth. Never had to mutual link and never paid for hits etc.

I think this is the link assassination technique that has been discussed in other forums ... Even if google think it can't happen. I'm fully expecting to loose my PR again in a week or two.

Cheers

Col

Dayo_UK

9:24 am on Jan 2, 2006 (gmt 0)



moftary

I have a site that was launched a month or so ago and has only been deep crawled by Mozilla Googlebot - 100% certain of that.

Those pages are showing in the test DCs with the cache date of the Mozilla Googlebot crawl. Pages are not showing in any other dcs.

moftary

12:27 pm on Jan 2, 2006 (gmt 0)

10+ Year Member



Maybe the mozilla googlebots are really the future as Matt says, but what I know for sure that the "spam detective in cover" does exist and has user-agent of mozzila googlebots. That doesn't mean that all mozilla googlebots are spam detectives of course, does it?

webvivre

5:07 pm on Jan 16, 2006 (gmt 0)

10+ Year Member



On Google.com we have #3 page ranking for a property keyword plus we have thousands of pages indexed on Big Daddy but the domain name page is not present, i.e

www.mydomainname.co.uk does not exist; there is no cache for this page

Has this page been "banned" - if yes, why are the inside pages ranking (and rankling well)

Any ideas please....?

This 126 message thread spans 5 pages: 126