Forum Moderators: Robert Charlton & goodroi
The site in question is a nationwide business directory, so its not hard to have a large amount of pages with unique content (each listing has seperate name, address, contact details and business type). The new site I also mentioned in my previous post has gone from 49K to 77K, and seems to be climbing. This is also a business listings directory BTW.
Fun huh!?
I've learnt from experience that sending on/posting URLs is not alway wise and have had issues with scraper sites/click bombing from more immature visitors.
The site itself is of course database driven and data has been purchased in the same way you can buy databases of links, scripts etc. You could say that other sites would be using the same data as mine and you'd probably be correct, so may lead to the assumption that I have a dup content penalty.
If this was the case however why does a search for 'blue widget x' bring back 500K results from different sites selling the same product? Why are people finding forums, one of the 'purest' forms of fresh content, being de-indexed? Why have newsgroup scraper sites which have exactly the same content as google groups still have millions of pages in the index? Why has a site I launched after the BD rollout not been affected by this de-indexing issue, or why is it even getting indexed at all?
I think it all points to a problem at googles end rather than anything else. Unless someone can find a clear concise explanation of what all the sites that are facing these issues have in common then i'm going to assume its something else.
I've been graphing the Googlebot traffic to these sites, and with the changeover to the Mozilla 5.0 bot, there's a very dramatic drop off about the end of March or so in all of the sites. Where there used to be several hundred to several thousand pages spidered every day per site, now it's down to maybe 20 or 30 - and they are the same 20 or 30 every day.
Meanwhile, I've put up five new sites in the past 30 days, and submitted them; needless to say they are not listed nor has the Googlebot even come to see them. I know that there have been a lot of postings about Google sandboxing new sites, but I have never experienced it myself - every time I add a new site, it seems to get picked up within a week. No more. Two of the sites have multiple inbound links from news articles on the local major newspapers (Detroit Free Press and Detroit News - they've been around a while) and those ARTICLES showed up in Google within hours after they were posted, so you'd think that the sites would make it that way, but nope.
Meanwhile, a personal site I run that has a single database driven page is just going like gangbusters - if anything, it's getting betters SERPS and traffic than ever, and has gained 2 PR in a very short time.
It's a mystery. Hard to come up with explanations for my clients.
Thank you for taking the time to explain in more detail how one can procure six digit page numbers.
I guess even at the age of 45, with 12 years experience on the Internet (but not in business on it), I have a lot to learn.
I don't even know what a scraper site is. I also didn't know that commercial databases were available for sale to the public.
On that note, in my innocence with four new sites that HAD good rankings - I have lost everything in Google.
No tricks, just hard work over a 6 month period copying data from a book because it wasn't available on the Internet. (believe me I looked day and night)
Someone mentioned that they proposed commercial sites were getting hit harder than amateur ones... I beg to differ. With my obvious lack of experience with SEO and all the neat tricks that come along with such education. I know none of them - but I have been bombed by Google and now I am right back to where I was in December 2005 before the robots.txt said c'mon in.
I have 7500 outbound link spread over four domains, lists of places to visit in the UK and Ireland. I have 5000 hotel/b&b listings and over 1200 high quality photos of the UK, increasing every day. My OBL's have been described by one chap as link spam, yet they were individually hand typed by me. (selected ironically enough from searches on Google).
After all the work I have done and continue to do every day, I am receiving no benefit from being honest with unique content.
Thanks again for your time yandos. I am very grateful for this forum, I have learned a lot in the few days I have been here.
Cheerio.
Martin.
[edited by: tedster at 5:51 pm (utc) on April 25, 2006]
"be honest - if you have a million pages, and you're not Wikipedia, most of it was generated to increase SE traffic"site:www.webmasterworld.com
Ok, point taken, some of you have a million organically created pages. And you think each and everyone of them should be indexed, eh? Personally, as a user, I don't. I have to wade through that stuff every time I do a search.
I have much sympathy for those with sites that have hundreds of pages of real original content missing, but they're likely collateral damage from G trying to clean out the dross.
To echo some other post in one of these threads (doubt if I can find it to quote), this appears again and again - the algo changes (now it's a infrastructure/whatever), and there are a flood of posts from people whose sites went missing. It's always in competitive fields, commercially and SEO-wise. It's seldom that niche sites go missing. But I'm to believe that Google is broken every time? Why does the line-up never change much for the research-site searches that I do? The best, most-pertinent sites might shuffle positions slightly, but that's it. For more popular search terms, G, Y, and MSN are as bad as ever (moreso, because of the MFA's these days).
That said, there have been serious problems with canonicalization for the last three years, not only with G, but with Y as well (it first happened to me in Jun 2003 and I fixed it then). A lot of damage has been done by that, people aren't aware of it, and when they do become aware of it, things are so messed up it takes ages to get it straightened out. That's always the first place to look for problems.
Anyway, for those of you who are having problems because of fallout from G's ongoing battle with the spammers - best of luck.
Actually for one of my main keywords, the top three results are all owned by one person. They are different domain names but they are all the same site, so duplicate content. To make matters worse, they all have AdSense on them. I'm extremely annoyed that Google are favoring those sites over my legitimate site. Grr.
They will be re-indexed eventually, right?
EDIT: Could this: [tech.cybernetnews.com...]
be a reason as to why our pages were removed?
sure as hell hope so I'm 30% down in traffic due to dropped/removed pages!
I can't see how its the proposed new layout changes that is causing this more (hopefully) a bid daddy bug that has dropped data and hopefully once a full crawl has been done the dropped pages will return - but like the rest I'm just guessing
My other sites that don't use sitemap still have the same pages index with no old (unexist)supplemental pages.
so i think that google sitemap is the problem. does anyone else have the same experience?
Now to googles effort to index the whole web!? Why should they deindex thousands of pages? This would bring them in leeway to their competitors yahoo and msn.
If i do a site: search all the 3rd level pages are gone. The pagecount gives back the right number of indexed pages.
If i do a site: search with the option to search for a 3rd level page, it will be found, but as supplimental.
So could it be that all our pages are in supplemental index and the site: search will not including the pages in the supplimental index?
And if so? Is it a bug or are our 3re level sides condemned to live their weblive in supp hell for ever?
Thank you for your concern, I really was just kidding. One of the downfalls of text only on a forum is the inability to see the posters expression. In this case, my eyebrows jumping up and down with a wicked grin on my face.
It is a kind warning from you however, about Copyright. Fortunately Arbitrary, I have been involoved in photography since 1972 as a young boy and fully appreciate the implications of copying someone elses stuff. Akin to running through a pool of petrol with a match eh?
I do believe that there are many ways to make money from websites that infringe upon the "dodgy" side of life. However I am not the type of soul to flat-out steal from others, or trick the engines. Sooner or later they will figure you out, take down your IP address, note your hosting Company and you are buried forever.
I'd prefer to have original content, let's face it who wouldn't? Originality however takes a long long time to procure and this is where the "assistance" of bending the rules comes to fruition.
I have to say that I am eternally grateful for finding this forum, for there has been no trickery in the willingness of folks to offer support and advice.
For that I thank you all, for your interest in setting a newbie to the World of SEO and SERPS on the right path.
My appreciation doesn't stop at the end of your posts. I remember your advice the following days and weeks, but mostly when I am just about to press that "put" button in Dreamweaver.
Thank you.
Martin.
I find it odd that just when Google is losing pages, Mediabot is spidering for the main index. Mediabot crawling is not about saving bandwidth. Yeah, BigDaddy is supposed to be complete but I think they are having big problems with it. Problems like having pages disapper from the index and having Mediabot pull up the slack.
When BigDaddy was complete, we were supposed to get a cannonical fix too. Well that hasn't happened.
Saving bandwidth with mediabot, now that is a spin worthy of the political landscape.
This provides a useful perspective. Yahoo has managed to include much of my new forums in their index, but MSN and Google haven't yet.
It's been observed in WebmasterWorld that Google started Big Daddy with an old index. Even so they are level pegging with MSN. Assuming by changing my sites around I've put myself on a level playing field with other large site that have been lost here, I'd say two things. First I'm not doing badly, and second useful content will find it's way back into Google once they've gotten around to indexing.
Thye second thing is speculation of course. But I'm convinced I'll be back with my new and very useful forum.
All had unqiue titles, kewyords etc. but clearly not enough
I now have all pages reindexed, by placing a whole of heap of total random #*$!e on the pages so they are unique. The content is total shite, the pages are now less user friendly, less relevant, but they are all back up there in the index.
Best of all? Most of my competition has been wiped out by Google, so all of a sudden I have the top search results for about 7000 products. Thankyou google. You are truly wise, keep on keeping on.
Google, its stupid. You've wound it up too high. Good for me, sure, but its wrong, wrong wrong.
A product listing, is not duplciate content just because the page it is listed on is only 3-4 % different that the other shopping pages on the site. its NOT right to try and place all these products on one page, they won;t fit, they need to go on seperate pages.
If so, that would make a little sense, except mine do have unique content on them, apparently not enough.
Google is trying to cull content pages, added for the purposes of bulking up sites.
eg.
opening up your forms to search engines creates hundreds of thousands of pages, many are very similar.
Products pages. you may have unique titles, keywords etc, but if your product descriptions are short, google is questioning why does the product deserve its own page with so little infomration?
having been in this game for 10 years producing good clean sites it does seem the only way to play at G's game is mass produced spamm on multiple sites - if thats the game G wants us to play then I'm just about ready to start throwing up sites
Tigger, don't do it. Regroup and figure things out, but don't move to the dark side - you know all that crap will be caught eventually, and then all you'll be working with is grab and dash.
Okay, are we talking about those who use templates that are the same throughout 100s or 1,000s of pages? You're saying Google's duplicate content filter is turned up so much that it's interpreting template based pages as too similar to others and is then dropping them?
That certainly looks like a possibility, judging by some of the posts. G prefers pages that are unique. Sites with many thousands of very similar pages might be getting flagged as spam.
Products pages. you may have unique titles, keywords etc, but if your product descriptions are short, google is questioning why does the product deserve its own page with so little infomration?
Good observation, nippi. The fact is that those pages don't deserve their own URL, they're only there to run up the page total (this approach used to work). I'm regularly amazed by the number of WW members who forget about G's ongoing war with spam/scraper/pseudo-directory sites and never factor this into their SEO methods. If you look like a spammer, even if you're not one, you can expect problems eventually.
google is questioning why does the product deserve its own page with so little infomration?
How would a computer program actually know that FG-2323 -FDS 3GB is a product? Maybe some bayesian filter, but is this feasible on total webword content?
Likely is that G has very little chance to distinguish product FG-2323-FDS 3GB from random letter junk FH-2323-FDS 3FB.
Even a human would have to actually be told what is what.