|Many Weeks since the Panda Update - Any Improvements?|
It has been 2 weeks now since google's Farmer update on Feb. 24th, for the sites that are affected, anyone see any improvements? For my site, we have started to remove low quality content a week ago, but have not seen any ranking improvements so far.
|Google's programmers think about the next 10 years |
I meant they're all trying to think of the biggest longest term projects they can now, regardless of results, so the can all be retired hapily by the time the system implodes and only links back to http://www.google.com/ for every result! lol
Thanks helpnow. At least its everywhere and not just me.
Helpnow, there are about 3200 pages on the site. Google had ~2700 in its index, but after my no-indexing thin pages, that's dropped to ~2400.
So far I've only changed several dozen, as I'm writing them carefully, and getting additional information as much as possible.
I have looked for dupe content, and found pages that were scraped. I've re-written many of them, but will need to do a lot of work to find all of them. I did searches in quotes for sentences from my pages.
Many of the widget pages have information that would appear on the manufacturers site under Specifications, and would also appear on other sites that feature the widgets. I'm trying to make some changes to the spec's, but if something is 6" long, it's 6" long.
On the main page I was talking about in my last post, I added some performance charts that users of the widgets would find useful (in fact, some had asked for them). These don't exist anywhere, but I have some fairly pricey software for creating these performance charts for my own widgets. So I created four different performance charts, and put one at the bottom of the page in question, and then put links on that page to another page of performance charts.
@dickbaker, how did you check the number of pages? site:example.com command? I already blocked so many pages, but Google is still showing 10 times more pages using this command.
@dickbaker 10-4. I wasn't sure how you'd respond, I didn't know your numbers or situation, but it sounds like you and I may be in the "same group."
So, you decided 300 pages were thin content, and dumped them. Are there more to go? You said you noindexed some, and you are down to 2400. Anyway, already, 300 on 2700 is just over 10%. That's a lot. (Not being mean, ;), I had more.) And before that, you already had 500 pages that google said no thanks to.
On the dupe content, it can be tricky. I have some awesome pages, that truly beat the crap out of everyone, and I used to be #1/#2/#3. Now, few of them are above #15. That's not quite true, I still have some #1s, but that's getting to the bottom of the barrel 3-word phrases. So, I am able to say "Some of my pages are better than anything else, and I lost ranking!", so, I know this is more than a page-by-page thing. Over the years, I've had numerous issues with dupe content, and it has been the same every time - the whole site suffers, deal with the issues (usually dumping tons of pages that shouldn't have gotten into the index in the first place), and in a couple weeks, ranking is back. Usually takes 1-3 weeks, with 10 days being the average to recovery. A repeated experience, I don't know how many times, over the past 6 years.
Over the past few years, more and more people out there have been gaming the system, scraping to grab content to drop onto MLA sites. Over time, I have been scraped A LOT, more than I realized. Hard to discover when you have thousands of pages like we do, and everything seems fine otherwise (when all is well, you don't spend much time looking for problems).
What I see now is massive dupe content. I've been massively scraped. And, I must confess, I've done my fair share of scraping, back in the early 2000s when scraping didn't even have a name, it was simply a reasonable way to get manufacturer's info, etc. etc. onto your site. As I describe this, I always envision a scale, with my original content on one side, and my off-site/on-site dupe content on the other side. It gets to a point where the dupe content tips the scale. It's not about your awesome unique pages. Its about your dupe content. That's my operating premise right now. And I compare this to some of my other sites, where this is not an issue, and they are dominating the SERPs right now.
I understand your issue about 6" long is 6" long. We have the same issue. Not too many ways to throw a thesaurus at that and rewrite it. Some thin content, and some dupe content you will simply have to live with. But I am willing - to - bet - anything, that it has to do with the ratio of dupe:original content.
Honestly, I don't think working on your home page will do it. I think you need to look under the rocks on your site, those #*$!ty back room pages you've almost forgotten about. I bet your home page is / was fine. (In fact, screwing around with it too much may send the wrong signal, dunno, that's just a thought, don't put much weight on this remark, but I can imagine it doing harm, and I can't imagine it doing much good, I bet your home page was fine before and after. I may be wrong, but that would be my first bet.)
So, I'm back to your dupe content, and this is where I get excited on my own behalf because I wonder how much we share the same situation... We already know you've had some really thin content (100 pages of thin content are pretty-much-the-same as 100 pages of dupe content, as they all look the same to a bot... ; ) ) In fact, your numbers are probably at least close to 25% thin content (read: dupe content), right? 800/3200. For me, thin content = dupe content.
Soooo... The rest of the pages, the 2400 that remain. Take a small cross-sectional, as representational as possible, popular pages, pages you forgot about, etc. Maybe take 10-20. Count the # of sentences, check each sentence to see if it is a dupe at google. And, do you come at the top of the list or at the end of the list for each sentence you search? (Top, good; bottom, bad). Anyway, I am curious to know what kind of numbers you end up with. Of 20 representational pages, how many had 0 dupe content, how many had some, how many were all? And if you took a simple one-dimensional stat: Page 1, 10 sentences, 6 dupe content, 4 original -> Page 1 original score 40%, etc. etc. and averaged the 20 pages, where would you sit? I know I am boiling this down to one variable, but I bet the result may be startling.
For me, this is the simplistic analysis that I have been able to use so far to easily separate sites into 2 groups, affected and not affected. Notwithstanding all sorts of other issues / variables / whitelisting etc. etc. that I can quickly give an arm-waving case-by-case explanation on why it is extraneous. But when I compare sites that seem to be all at the "same level", the above hypothesis seems to stand.
And, when I am pragmatic about it, I can see why google wants dupe content out ($), and why they may even "penalize" a site with a wake-up shot across the bow to get your attention to help them stamp out dupe content. And if a site doesn't get the message, and they slowly wither away and die, well, sorry but, good riddance, we're cleanin' up the SERPs. ; )
Do you any of you guys have affiliate links? If so, do you house them in redirect file blocked by robots.txt? I see three sites that were hit, and they all mask the aff. links in php redirect files, blocked by robots.txt. In fact, one of the sites that was hit is mine, and another nearly identical site of mine was not hit, but I don't block access to the redirect file on it. I doubt this is an issue, but it did catch my attention.
@crobb305 Yes, my affected site had adsense and amazon. My unaffected site, no. But let me qualify: on my affected site, we only popped up adsense and amazon for products which were discontinued. But still, over the years, we haven't added many new products, and products do get discontinued all the time. And for sure, our adsense and amazon was above the fold. I moved that all down below the fold now. Truth is, we don't display the adsense/amazon to googlebot - the only thing we cloak.
|we don't display the adsense/amazon to googlebot |
I find it plausible that Google is taking issue with any type of ads being blocked to Googlebot. As I said, one of my affiliate sites was unaffected (in fact has gained position), but it doesn't hide the affiliate links from Googlebot, unlike my other site which was heavily penalized (both sites house the affiliate links in a redirect file). In fact, just checking my WMT on the penalized site, I see the "restricted by robots.txt" errors mounting, as Googlebot keeps trying to access those links, as recently as yesterday.
Coincidence or a quality-detection issue? In an article I read recently, someone at Google made a specific reference to robots.txt. Don't block access so they can "know about the content" (approximate quote). It may be worth allowing Google to "see" the ads. Surviving sites in my niche, who also have affiliate links, do not block Googlebot access (as far as I can tell).
Like others here, we don't block, we just use User-Agent to decide what to send back to the browser. We use this to intercept calls to forms, etc. as well. This is a legacy thing, years ago, we had issues with bots clicking on ads, our internal search engine, etc. etc. so it just became part of our MO to cloak ads and forms. Does that concern still apply in 2011? Is it an issue? Dunno. There is a positive correlation there, between adsense+amazon and my affected/not affected sites, but my brain insists it is a red herring.
I see. I misunderstood. Thought you were saying that you denied access via robots.txt, I missed the "cloaking" part.
I still may test removing the restriction. I doubt it will do any good, but I am trying everything like you are. It is an odd coincidence though that 3/4 sites I have looked at are denying access to affiliate links, whereas the 4th site that doesn't block access actually gained position.
[edited by: crobb305 at 9:18 pm (utc) on Mar 20, 2011]
crobb, I've got 2 sites, both do the same things with affiliate links. Both are blocked. One hit, one did not get hit. The one that got hit is 3x bigger than the one that was not affected. Same exact model -- difference is in link structures, and duplication issues for me.
Masking the url or blocking means nothing since Google still spiders them. I did a site: for a competitor that went way up and /visit.php?4444 for example was shown as a link with the target site's title. All blocked by robots.
I deleted all my tags and about 30% of my other pages, yet my traffic went down, even for pages I ranked better. Either Google has turned the knob toward more Panda or the changes haven't yet been calculated. Otherwise it makes no sense.
One thing I noted: noindex means jack to Google even if they access it several times over 2 weeks.I see those pages on my SERPS so I turned to removing them on Webmaster Central and blocking them with robots.txt. Thanks for wasting my time Google, I have hundred of pages that must be removed by hand, they are not on a unique directory.
I also get bursts of traffic from Google, short but when they happen I know there's a new set of Serps out there.
@Walkman, I think you're right (about Google spidering blocked/noindex urls). The statement I was referencing, and the basis for my speculation, is in this article [searchengineland.com...] "She (Google’s Maile Ohye) recommends (noindex) over blocking via robots.txt so that search engines can know the pages exist and start building history for them..." The problem I have with this statement is that it implies that Google doesn't know about the existence of a page if it is blocked by robots.txt, but like you, I see my blocked urls indexed when I do a site: search (albeit with no title/desc).
Anyway, I am just studying the data, and offering up ideas, but ultimately I agree with you about what Googlebot spiders and knows about versus what they are SUPPOSED to spider/know about (i.e., robots.txt).
I still think there is a small chance that blocking affiliate links could affect the quality score, especially since, in that article, they addressed both robots.txt AND ad-to-content ratio. One reason I say this is because over the years, my affiliate links have come and gone, in and out of my redirect file, yet Googlebot still tries to crawl the dead ones. Googlebot doesn't seem to know they are 404, since I have the file blocked. Whereas I only have 5 or 6 active affiliate links in the file, Googlebot reports 28 restricted urls (which it probably "knows" are affiliate links, thereby inflating my ad-to-content ratio).
I'm using webmaster tools to tell me how many pages are in Google's index.
I haven't devoted an entire day to searching for dupe content, but rather I'm just taking the pages that were hit, but are still in the top 30 to see if there's dupe content out there. The pages that really tanked need a different type of analysis. the few I checked I found were lifted verbatim from the mfr's sites, and so I noindexed them.
I find the idea a bit strange that Google may look with suspicion at sites that have new content after being demoted. Shouldn't Google look at sites where nothing is being done, and figure they're throwaway sites?
Lastly, I know there's all sorts of junk taking the places of our wonderful pages, but I found one that should get an award. It's one of the sites that is now on page one for the phrase that my #2 page used to rank for.
There's a couple of small crude graphics on the page. Other than that, it's a list of models of Acme Widgets, with prices. Just a list. Nothing else Any links for more detailed information go straight to the manufacturer. There's nothing except model number and price. Oh, and a Contact Us link for a phone number for ordering instead of any sort of shopping cart. This site must be a relic from the 14.4 dialup days.
What are the odds that Google is penalizing for outbound links on a page? For instance on my pages I have an outbound links to twitter, to facebook, and a couple of links to mobile apps users might want to better view the site.
Could they be dinging on that? How many outbound links is too many?
If the search term is not competitive, you will not fall very much. Heck, you may still own #1. Focusing on the # of positions lost may be misleading. What I mean is, a page that may have only fallen 1 page is not necessarily any better than the page that fell 5 pages. Different SERPs. The page that fell 5 pages may be the better page, but swimming in a bigger pond, so the fall seemed more dramatic. ie. it may not be true that "The pages that really tanked need a different type of analysis." My experience has been that it is not about the page, it is about the site. That 14.4 baud modem page may be a pile of crap, but the site it is on may be all unique content, albeit useless unique content. And as more of us figure this out and adapt and start to recover, that 14.4 baud page will slowly drift downwards back to where it belongs - but right now, for that SERP, it may simply be among the best page (sorry, best page/site combo) according to what the algo now values.
That's my 5 cents worth. We have been conditioned for years to focus on the SERPs page by page. I think those days are gone. Now the page matters, and the overall site matters, and one cannot win without a strong showing by the other.
|That 14.4 baud modem page may be a pile of crap, but the site it is on may be all unique content, albeit useless unique content. |
That page is the site. :(
Well, there you have it. They aren't a content farm. :-)
Though seriously, I think there is actually some truth to that.
Just fyi, I dumped every single affiliate link on my site. Completely removed every page that had anything to do with affiliate products. Didn't change a thing (yet).
This looks more and more as a freezing of the SERPS /penalties for at least a month. Or maybe it takes google time to really call the page/s as noindex or 404 to avoid temporary server issues.
Some big sites have come back but not because of changes, it is impossible for such large sites to be re-indexed in a day or two, let's be honest.
So we're back at square one and probably doing 100 things so Google likes us, even if they make no sense for the users.
This algo is closed and frozen, there is nothing to see here, solution: a maledives vacation package and forget about their dumb panda algo.
This algo can be gamed easily therefore they need to freeze this sh*** so no one will play with it. Poor Google. Their is no other reason. They need to protect their new multi billion dollar algo becuse they know a 10 year old can game this nonsense.
I will push a new site, 500K URLs, a few PR76-7 backlinks, will see what this childish algo will do with a completely new domain/site.
[edited by: SEOPTI at 2:29 am (utc) on Mar 21, 2011]
I ran a site: search, and discovered over 20 ancient/dead affiliate urls (masked in my redirects) still indexed. Some have been gone for over a year. WMT keeps telling me I have 30 restricted urls (all of which are affiliate links) but only 6 are active, the rest have long since been deleted from the redirect file. So, for the heck of it, I set up a 410 on all the dead ones, submitted through the removal tool, and removed the deny access from robots.txt. This way Googlebot can effectively assess my ad-to-content ratio. I don't have 28+ affiliate links as the site: search would suggest.
walkman - Your use of the word "and" jumped out at me in the following, and you should be advised that meta robots noindex and robots.txt should not be used together....
|One thing I noted: noindex means jack to Google even if they access it several times over 2 weeks.I see those pages on my SERPS so I turned to removing them on Webmaster Central and blocking them with robots.txt. |
See my comments in my second post on this thread, along with various threads referenced about meta robots noindex and robots.txt....
Robots.txt blocking and Google's behavior
Essentially, robots.txt will keep Google from spidering your pages and seeing the meta robots noindex tag... so your pages won't be accumulating and recirculating PageRank as I suspect you've anticipated they will.
Robert, thanks for the explanation. I just saw another thread about robots.txt [webmasterworld.com...] . So, I've decided to remove disallow from robots.txt.
I already removed folder using GWT tools, I think it is OK if they show 404 errors.
I added noindex 3 weeks ago. No blocking of robots back then and and I pinged the pages several times. Google has gotten them at least 3-4 times, yet I still see them on some indexes. Maybe it's a flux, who knows.
So now I want to first delete /page.htm in Webmaster Central and THEN block it on robots.txt
|I just saw another thread about robots.txt |
That's the thread I'm talking about. ;)
Whoops! I meant to give this url [webmasterworld.com...] :)
@walkman: "So now I want to first delete /page.htm in Webmaster Central and THEN block it on robots.txt " If you try to delete in WMT, it will require that you block it first in robots.txt. They do that to ensure the person trying to remove the page actually has low-level access to the server.
1. Block it in robots.
2. Use WMT to remove it.
3. Within 24 hours, it'll be gone, then take it out of robots, of course ensuring that your noindex is in so the next time it gets crawled normally, it hits the noindex and doesn't go back in the SERPs.
This is the quickest way to get a page out of the SERPs and keep it out.
@helpnow How would one go about doing a comprehensive analysis that goes "sentence by sentence" to find duplicated sentence issues, finding sentences on your own site that appear elsewhere on your site or on somebody else's site?
I of course know that I can grab a sentence and plug it into Google with double quotes around it to see if somebody has scraped it. But doing that manually is not a viable solution for a big site.
I can probably write a program to go through every page on my site, grab one of more sentences, feed it into Google, scrape the results, and see if they found multiple sites using the exact sentence. But in my experience Google quickly flags the IP address and doesn't allow you to continue to do automated searches. So, again that solution is a dead end too.
So, for a big site with hundreds of thousands of pages, is there any way to do this? Any third-party tool or service?
It's hard to fix the patient when you can't do a thorough diagnosis. Any ideas would be appreciated. Thanks.
I'd just start with the URLs that lost the most search traffic. I'd bet within a short time, patterns will start to be clearer.
|So, for a big site with hundreds of thousands of pages, is there any way to do this? Any third-party tool or service? |
I'm sure a good coder could find a way to make use of one of these:
Of course to do anything with either you basically have to get into writing a bot and parsing the information you get from other sites, which may be beyond many, but would probably be very enlightening for as many or more, even just to try and detect similar text they know exists on another site, because to do it reliably you really have to get into how to extract the main text from the template, and that's definitely a challenge...