Forum Moderators: open
I added some HTML ads, ie with text, which I serve via my own js/php script and wanted to avoid:
1. That google indexes the content of the HTML ad, and that skews what it thinks the page is all about (it happened all the time when I try to add a HTML ad on Adsense pages, which forced me to take off the other ads and leave only Adsense)
2. To avoid that googlebot "clicked" on the php link of the ad, trying to find new links/content.
So, I created the ads in a separate document for the ads, which i include via iframe. I put it in a directory which I exclude via robots.txt
<iframe src="http://www.domain.tld/ads-files/ads.html"
and in robots.txt
Disallow: /ads-files/
That was a week ago (last Friday).
Meanwhile, Googlebot has downloaded over 1000 pages with the new structure (ie with the ads iframe), but HAS DROPPED ALL OF THEM FROM ITS CACHE ALTOGETHER!
I've just checked about 30 random pages, all fetched by Googlebot several days ago e.g. on 19-May-2004 (in my experience Google includes new docs in its cache very quickly) and THEY HAVE ALL BEEN DROPPED from the G cache.
And referrals from Google have dropped to 1/4 compared to a week ago. Probably the rest of the pages will get dropped as soon as G sees the iframe in them -> zero traffic from G.
I have no doubt that Google seems to think there's something seriously wrong and penalised those pages. Pagerank is not affected sofar (main page has a PR6).
Btw this is an old site, established in 1995, with over tons of "real" backlinks.
I've just re-included the html ads iframe doc via robots.txt and added a NOINDEX,NOFOLLOW directive in it. Maybe this will make G change its mind...
So my Q is:
Can I exclude certain parts of a page's content (ie advertisments) from getting indexed, without risk of getting penalised? How?
It's for G's OWN BENEFIT as far as I'm concerned and I'm getting penalised for this? According to "Web Spam (spamdexing) Taxonomy" paper from Stanford, SEs would appreciate it!
For instance,few sites serve to search engines a version of their pages that is free from navigational links, advertisements, and other visual elements related to the presentation, but not to the con-tent. This kind of activity is welcome by the search engines, as it helps indexing the useful information
I had thought of several ways to "hide" HTML ads text from googlebot:
1. SSI customized content delivery (aka "cloaking" -- but of the "good type" imo)
2. put the html ads in an IFRAME, excluded via robots.txt
3. Use client-side javascript (writeln etc) to compose the ad client-side.
Since G people seem to dislike "cloaking", regardless of motivation, I went with #2, ie IFRAME.
As I wrote above, Google dropped ALL pages (1000+ pages) with an iframe, where the iframe was excluded via robots.txt. Btw, I saw people "recommending" this method in another Webmaster forum a few days ago, to exclude PPC links ...
I've now allowed G access to the iframe.
After getting burned like this, I'll do gradual testing of different methods, e.g. using "nofollow" in the iframe's html. But if G is so quick to give a penalty at the page level, with the slightest hint of something being hidden (that site had many PR6 pages at the top levels btw, but lower-rank pages <PR4 were dropped), then I will still have a problem, because the HTML ads are loaded via js from the iframe, and G might then think it's something questionable in the js.
Damn, the SE spammers are making life so much more difficult for us, like email spammers do.
Recap: 1000s of pages dropped from visible index, ie no referrals, not in the cache (checking via GoogleToolbar), since including an iframe in the page. Only higher-ranking (PR>3) internal pages were saved. The reason for the iframe was to be hold HTML ads, without having to modify the timestamp of the article page with every new ad.
Site is 100% clean, PR6, exists since 1995. Sits on same IP with other PR6 sites. Has over 40.000 pages of 100% unique content (fulltext newspaper archive).
More than 80% of the site's pages have been dropped since adding the iframe. During the first week of adding the iframe, I had excluded it via robots.txt (to avoid G indexing the HTML ad text). After a week and 50% of the site dropped, I removed the exclusion from robots.txt
Googlebot routinely visits the site all this time, spiders those pages w/iframe and logs a HTTP 304 code, ie page w/iframe is stored somewhere at Google, just doesn't reach the public index. Once a month, Googlebot spiders the whole site.
Although G has spidered the new (ie w/ iframe) version of the page, those pages not dropped, still show the pre-iframe version. It's been over 1 month since the "new" (ie w/ iframe) version of the site was spidered by G, but G shows the old version in its cache.
It seems that Google seems to think something is very wrong with iframes (excluded via robots.txt or otherwise).
Hope this helps someone else, as these kind of problems take weeks/months to resolve, always via trial and error.
1. The site's content is in greek, except an html ad
2. I loaded the iframed doc from a different host in the same domain (e.g. www2.domain.tld/...)
3. I excluded the iframe html doc via robots.txt, although my experience sofar is that G doesn't follow iframes
in the meantime I've changed the iframed doc to load from same host.domain.tld and allowed full access to the iframe doc (not listed in robots.txt anymore)
My theory is that a combination of things (many pages, "gibberish" greek text, blocked link to iframe doc etc) raised a red flag at Google of possible spam, to be inspected by human.
Pages were removed from public index (toolbar reported them not in cache) pending human review, yet remained in the Googlebot's cache (when it came again for same documents in late May, it produced a 304 for pages which were dropped from the index for ~2 weeks).
Just a theory, but it sounds right.