Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages Dropping Out of Big Daddy Index

         

GoogleGuy

6:11 am on Apr 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Continued from: [webmasterworld.com...]


One thing to bear in mind is that Bigdaddy will have different crawl priorities. That can account for some of it. If you've run into any spam problems in the past, you might also want to do a reinclusion request. Otherwise, please send an email to bostonpubcon2006 at gmail.com with the subject line "crawlpages" (all one word), and I'll ask someone to see if they notice any commonalities.

RichTC

12:37 am on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Anyone of have experience of this:-

1.Link to a new page added to site map - Google cashes the site map but doesnt follow the listed link and cashe the new page off it. Why might Google cashe a page yet ignore the content on it?

2. A PR5 page with lots of links to it from authority sites (3 PR7s links to it for a start off), the page is about blue widgets and should rank high for the blue-widget search term yet, another page about blue widgets on the site (but not as detailed and has zero backlinks to it)features at the top of the serps rather than the one thats more specific that has loads of authority backlinks.

Any ideas please post, im at a loss trying to second guess this?.

Stefan

1:47 am on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



... unless Google has raised their unique content filter to "must have at least 90% unique content" I don't have an explanation.

If it has become 90%, is it that unreasonable? It would be a very effective way of cleaning out pages that aren't genuinely unique. I'm not saying it's the problem, of course...

Apart from that, and although I know I really shouldn't refer to an earlier post of mine in the previous instalment of this thread: if my rather callous comment on the absence of a certain member resulted in a very valuable contribution soon afterwards, I don't regret having written it, even though, once again, my sense of discretion was entirely lacking. So it goes. It's good to see him onboard when people are desperate.

pixellion

1:56 am on Apr 27, 2006 (gmt 0)

10+ Year Member



I understand that the new crawling system is in transition phase right now, but in the future how does Google plan on handling indexing where Mediapartners bot is allowed but Googlebot is blocked? I posted this message on Matt Cutts blog, but he removed my post from the comments. I don't understand why he did that.

For instance, my robots.txt looks like:

User-agent: *
Allow: /

User-agent: mediapartners-Google
Allow: /

User-agent: Googlebot
disallow:/

pixellion

2:38 am on Apr 27, 2006 (gmt 0)

10+ Year Member



Just a follow up on my previous post:

I posted two messages on MC's blog early this week, but he removed them because my questions were very much related to BH SEO. I guess Black Hats aren't supposed to post on his blog. haha. Anyhow, here's my thought on this: If they're going to have a proxy, they better OBEY the rules specified in my robots.txt. For fearful reasons (Yeah, Google banning my Adsense account), I don't allow Googlebot to crawl my site, but I definitely allow Mediapartners-Google to display ads. I only allow non-Googlebot bots to enter my site including mediapartners. I'm fine mediapartners doing duty work as long as they STRICTLY follow my robots.txt rules. I have a few concerns over this issue:

If I don't allow Googlebot in, then there's a possibility that my site's content might not be fetched which will allow adsense to display relevant ads. Let's consider the following scenario: Googlebot comes in, my site says, "Googlebot, I don't let you in, but I allow Mediapartners in", so Googlebot goes back tells proxy that the site didn't allow me. Now, when I ask Adsense to display ads, Adsense will check with proxy to find if that content is already there for that URL. In this case, proxy says "no, the site denied entering". IF adsense completely relies upon proxy, then it will say the same thing to me... "Sorry bud, I don't have any ads for your page since you didn't let me (all services/proxy) in and it displays public service ads." But if the proxy says "Hey Adsense, no, the site denied letting Googlebot in, but the robots.txt says to allow Mediapartners in, so you might want to go there again and see if you can find anything and if you do, show them the relevant ads." Now as mediapartners gets the page content, then I EXPECT (and NOT HOPE) the proxy to mark that page as "mediapartners-only" page and not show that page in Google's search index.

I also asked Matt Cutts if they are going to consolidate how the User Agents are currently identified as while crawling. I read somewhere that Google UA's are now identified as "Mozilla... something". If this is going to be their ultimate outcome, they better give us some documentation on allowing/disallowing Google-service-specific bots, how they plan on obeying robots.txt AND whether or not they are strictly going to follow robots.txt for all subdomains/domains across the planet. They will undoubtedly save tons of bandwidth with the new proxy system, but they will need more computational power, which I think they don't have any problems with.

I used the following my robots.txt in my scenario above:

#BOF#

User-agent: *
Allow: /

User-agent: mediapartners-Google
Allow: /

User-agent: Googlebot
disallow:/

#EOF#

What is your take on this?

Thanks for your input and sorry for the long post.

Nabeel

5:32 am on Apr 27, 2006 (gmt 0)

10+ Year Member



Dropped from 9,000 pages to 74 pages for my main website.

All of my other website pages also dropped. I write articles & my content is unique.

SteveJohnston

8:33 am on Apr 27, 2006 (gmt 0)

10+ Year Member



> different crawl priorities

This explains what I have been witnessing on a very small, very niche, single product book site a client of mine has.

Just before BigDaddy we uploaded 300 pages of unique images from the book complete with captions. This had the desired effect with all 300 being sucked up by Google and traffic to the site multiplying nicely - bear in mind the site only had 15 pages before that.

BigDaddy then happened and all 300 new pages simply disappeared. Just today, I notice four of the image pages have found there way back in.

This is also consistent with a comment Matt Cutts made recently about crawl depth being a factor of PR; the healthier your link development, the more appetite G has for your pages.

So the crawl behaviour appears to be more discerning now as a consequence of these changes. Which makes sense from Google's perspective, even if it is frustrating in the meantime. So, back we go to making sure our sites are loved by other sites :-)

Oh and RichTC, I reckon your first point is exactly the issue MC was referring to about crawl depth and PR; it doesn't matter how much content you make available to Google, it's appetite to crawl it is independent of knowing it exists and this includes any data you give it by way of Google SiteMaps. Didn't used to be like this.

Steve

john alphaone

9:18 am on Apr 27, 2006 (gmt 0)

10+ Year Member



Matt's exact words were "One of the classic crawling strategies that Google has used is the amount of PageRank on your pages."

Note "classic" and "has used". My feeling is that BD has changed the rules and other trust-based factors are dictating crawl depth. Still comes back to good quality linking strategies though.

idolw

9:40 am on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



For fearful reasons (Yeah, Google banning my Adsense account), I don't allow Googlebot to crawl my site, but I definitely allow Mediapartners-Google to display ads

pixellion, if I were Google I would definitely stop displaying ads on your site and make it obligatory to let in both: googlebot and mediapartners bot.

Why?
Because, AdSense is not a never-ending money river. this money comes from somewhere, and this 'somehwere' are most probably some other guys who pay for advertising. if adsense quality decreases even more, google will lose its revenue stream.
so it is natural they won't display proper ads on your site and similar ones.

in fact, as an advertiser, i am happy about it.

ClintFC

9:47 am on Apr 27, 2006 (gmt 0)

10+ Year Member



The notion that it would be "understandable" from Google's perspective to remove any pages that are 90% non-unique content is ridiculous. People tend to have tunnel vision when it comes to deciding what constitutes Spam. I've said it before, but Google themselves are 99.99999% non-unique, scraped duplicate content. By many definitions I have heard discussed here they are therefore the Web's Uber-Spammers.

Or...if we (and hopefully Google) could just be sensible about this for a second, we could recognise that "unique" is NOT a magic good versus evil metric. Any search service, including Google and a growing number of vertical (meta-) search services will, by definition, be comprised of largely duplicate content. Clearly, these service are not spam.

Is a Website providing a technical service (such as a sophisticated vertical search engine) really to be considered less valuable than a Website comprising a bunch of "original" text.

Anyway. I don't think uniqueness has anything to do with the bug we are seeing. Anyone else started to leak pages again? We are starting to lose most of the new pages that were added in the last week or so.

mattg3

11:05 am on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is also consistent with a comment Matt Cutts made recently about crawl depth being a factor of PR

Could be true PR 6 homepage lost 50% of pages. PR 5 90%.

gford

12:16 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



Nah. I have a PR6 that lost >90% of its pages.

SteveJohnston

12:28 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



> the healthier your link development, the more appetite G has for your pages

I said this because I believe it to be as important as your actual PR value. If your links are increasing regularly, then Google will be more interested in crawling.

But then I didn't mean to imply that this was going to be that simple to work out, just that if they are ramping up this kind of contingency, then we will see some significant changes on sites which, amongst other things, have stale links.

Steve

cgchris99

12:34 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



I thought I was going to recover last night. I did a site:mydomainhere.com and it showed only 14 results. Granted there is a lot more pages but I didn't see any supplementals. I thought maybe it was reindexing.

This morning I see 4380 results and page 7-the end of the google listings shows all supplemental results.

prieshach

1:32 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



10,000 plus now down to 162 and still dropping. Doesn't seem to matter whether they have any similarity or not.

dramstore

2:07 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



I seem to have stabilized at around 300 (out of over 100K)
Looking at whats left, I still cant figure out any logic behind this.
It does, on the whole, appear to be level related and also duplicate page related, but I can find examples which disprove them both?

Still being crawled furiously too.

graywolf

2:20 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm seeing pages drop and 301's being handled poorly, content being attributed to the wrong site.

tigger

2:25 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> the healthier your link development, the more appetite G has for your pages

so how can new sites ever get off the ground with the fear of the sandbox & now poor crawling due to low LP. plus this also doesn't explain how PR4/5 sites have dropped 1000's of pages

jwc2349

3:12 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



My problem is different that fluctuating number of sites in BD. My problem is the inability to get back into Google's good graces after apparently triggering a duplicate content penalty. My site hasn't been banned, just moved from the first page of serps to pages 3-4. As a result, traffic is down 99% since mid December.

Prior to March 9, I had 42,500 sites in BD--all of them with caches from July & August 2005. I contacted Google and received a reply within 3 hours notifying me that they were referring my email to the engineering team. Within a week, the number of sites with fresh cached pages went to 400,000. But still the same rank.

Then I found that the same programming firm that triggered the dup content penalty incorrectly used 302s instead of 301s. I cleaned that up a month ago. Still the same rankings.

I have read that Google penalties last from 1-6 months. I certainly hope so since my pain has been almost unbearable for 4 1/2 months.

Any ideas/suggestions?

lgn1

3:30 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When you are talking about 90% unique content to avoid duplicate content, does this exclude HTML tags, scripts and layout.

I have a common design to my site, and the common elements, tables, etc would easily account for 20% of the site.

I could understand 90% unique text, but not 90% total content.

ichthyous

3:45 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have seen my indexed pages go up and down like a rollercoaster over the last month...two days ago it dropped again and what pages are left are mostly listed as supplemental. The supplementals are old cached crap from 2004/2005 that google is dredging up somehow...why? My site is ecommerce, and I have resigned myself to the fact that I will have to ditch the current site and start over with a more search engine friendly format. The truth is nobody really knows what Google's new index will end up looking like.

Freedom

3:48 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thinking outloud:

Hardly a year goes by that Google doesn't have 3 or 4 algo fiasco's that obliterate webmasters in a zero sum game. Over the last 3 years, if I can get 2 to 3 months of Good rankings from Google, then that's about all I can expect from them. If I don't get hit by an algo change, then I get hit by a FUBAR they can't fix and on average, my pages are in the Google toilet 9 or 10 months out of the year.

With one exception: I have 1 website that has "authority" status in Google [because of all the .gov, .edu, .org and newspapers, etc. that link to it]. With that site, I can get in the top 10 or 20 for just about any term I want - just because it has "authority" status. But since it's in just one niche area, from a professional standpoint I'm limited on what I can add to it.

Google SEO can be boiled down to that one fact: Get a website with "authority" status and you can then milk google like one used to be able to when all you had to do was get 100s of non relevant links - which worked from 1999 through Oct. 2003.

Sadly, that's all you need to know about SEO for Google. In my sarcastic opinion (IMSO)

But with my other sites, 2 or 3 months out of the year of mediocre to good rankings is about all I can expect with Google. Their FUBARS have gotten so bad and so commonplace, that my attitude is now: If I write this content or build this new website, can Ask, MSN and Yahoo give me enough traffic to make it worth it?

jsavvy293

4:16 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



Has anyone noticed a crazy influx of pages in G or just pages dropping out? I know of one site that has about 3mill actual pages, but site:www.domain.com sees it as having 29.8mill pages indexed.

ulysee

4:22 pm on Apr 27, 2006 (gmt 0)

10+ Year Member



Freedom, that would explain why authority domain redirect/spam techniques have taken over the "adult" sector.

wheelie34

6:58 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A friend of mine has a site thats 50% .html and 50% .php? url's, now, he has worked out that all the pages that have gradually dissappead went over a period of time, the same time span to this day as I have read here, when he told me about his findings I told him its the Big Daddy shuffle and fired more questions at him.

I checked his site, he has 301 to www in place, has no supplemental pages and has a PR5 (for what its worth) now, he has always checked to see when new pages get into G using the site: command, has for the last 3 years he says, he swears that those pages were in G for atleast 3 months and had PR.

He now wishes he had known what I told him so he could have looked for which pages vanished and when, as his downsizing was in definate stages, not the odd one or two pages, he remembers thinking when he did site: instead of up/down by 1 or 2 it seemed as though "folders falling of a shelf" his words.

So the question is, has Google lost the ability to crawl php? when it could and did in the past or is google forcing the ReWrite rule

tigger

8:09 pm on Apr 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>So the question is, has Google lost the ability to crawl php?

its not just php my site that has dropped 70% of its pages is all html and I've had pages up for weeks that are still sat uncrawled - fortunately MSN & Y has crawled these pages a good few times and are now sat and ranking well - so come on G sort your dam act out!

Stefan

3:58 am on Apr 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



When you are talking about 90% unique content to avoid duplicate content, does this exclude HTML tags, scripts and layout.

I was thinking of it as just the text content, not all the tags etc. Again though, I don't know if this is a factor in things. Really, I don't what the hec is doing it, although part of it is canonical, no doubt. But I do know that BD not only didn't affect me, it helped (just like Florida, Jagger, and the rest). I'm not trying to crow about it, truly (I've read tens of thousands of update posts since I joined), and granted my site is niche (although the canonical problems were taken care of 3 years ago), but I rock on some very competitive searches, and some of the less succesful of them went from #2-3 to #1 in BD. So, it's not a total systemic problem, and not random. It's site-specific. Maybe it's a temporary crawling/indexing glitch that is triggered by certain factors on a site, which will soon be corrected, and all those missing pages will soon return, or maybe not (I haven't seen any reports from those who were affected by the Sandbox suddenly telling us they're at number one on their intended serps, and the Florida casualties never really bounced back). If it's a "not", some folks are going to have to rethink their game-plans, and move on.

texasville

4:13 am on Apr 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I don't think it's site specific or anything else. I think it is a meltdown in the d.c.'s. I think google lost huge chunks of data. I think it is one huge earthquake moving thru their dc's and they don't know how to stop it and they started this new bot thing as a smokescreen.
I think google has blown a fuse!

tigger

6:11 am on Apr 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I hope your right texasville and so much does seem to point towards the theory "Google is broke" but as we are all running around in the dark trying to work out why pages have been dropped and whats stopping G doing deep crawls we have to keep examining different ideas - just I've run out of them now

kiwiwm

7:11 am on Apr 28, 2006 (gmt 0)

10+ Year Member



I started a site last July and spent about 6 months in "the sandbox aka the pit of dispair" finally the happy day came when typing in my domain on google actually produced results (this was after google had used 40Mb or more every month for 6 months indexing the site but hadn't included a single page) Fairly quickly after that level 2 and 3 pages started making it into Google and I got a pagerank of 2 (not great but a big improvement on 0).

About a week ago or so I noticed results dropping, all I had was the main page and 58 "supplementals" which were all pages from July/August 2005 (they had cached pages back then but just never included them in the index) Anyway as of today all the supplementals are gone and a site:xyz.com search produces 1 result - my main page and it's not the text from my main page in the description but the description from dmoz and the google directory which is showing. So basically I'm back in "sandbox purgatory" or possibly even worse, plain old "I don't really exist hell".

The thing is I can't work out why, for the life of me. It's a non-profit fansite for an actor who is a household name. The content is original, in fact so original that I can honestly say there is a lot of material on the site which can't be found anywhere else on the web - because we actually scanned it, or wrote it. We have inbound links from PR 9's like IMDB and DMOZ as well as lots of 6's and 7's. We have won awards for being a great site for content. We are on Page 1 for both Yahoo and MSN for searches on our main keywords - we are if not the best, certainly in the top 2 sites for our content type, which is why we have plenty of inbounds from peoples blogs, livejournals, forums etc to our main page and also deeper pages. I just wanted to post that I concur with the view that google may be broken in some way.

What may be the problem in our case is almost all of our pages are no archive - because we have had problems with plagiarism so we have a few ip ranges banned and we don't want the site cached. It is interesting that after BD came in the only pages that showed for our site were those from before we made the pages no-archive, that's why we had a bunch of supplementals that were cached in July/August 2005. I tried adding a sitemap hoping it might help - interestingly it came up and said pages were "partially-indexed" referring to the fact they are "no-archive" and then in a further explanation they said we "try to fully index pages" so I wonder does BD react differently to no-archive? are we being penalised? do you no longer have the ability to opt out of the cache and still appear in the index? Anyone have any idea or experienced any problems with no-archive pages?

I would appreciate any feedback if anyone has experienced anything similar. Thanks.

tedster

7:39 am on Apr 28, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



An excellent and detailed post, kiwiwm. If you feel so inclined, pass that detail on to the Google team, via the email address that GoogleGuy gave in the first post of this thread. They need to understand what's happening and find the common factors between the sites that are hurting.

Yes, you do have company -- and no, it doesn't seem to make much sense.

This 254 message thread spans 9 pages: 254