Forum Moderators: open
Cloaking via http 'Referer' header (snip)
- Requesting a page without Referer header (e.g. like GoogleBot does) produces a Spam page
- Requesting with Referer header (e.g. link was clicked from Google search results) produces a redirect to a affiliate site, so also doorwaying.
GoogleBot could request once with Referer and once without Referer (and act on substantial differences) or always set a random Referer. Does anyone know of a legitimate use of above technique?
This is an example of spamming by a German company, also a type of cloaking in that the pages the user and googlebot see are very different. Through the following css statement most of the page is rendered outside of the browsers window so the user cannot see it:
h2 {position: absolute; width:800px; left: -220px; top: -100px;}
I'm not quite sure if GoogleBot crawls css, but if it did, it could check that at least 22000 pixels of h2 are offscreen (and of course the tags you don't see are full of keywords and links to other spam). Does anyone know of a legitimate use of this technique?
If there is no known legitimate use of of these techniques I think Google could really reduce the amount of spam in its index.
Eddy4711
[edited by: Marcia at 9:34 pm (utc) on Mar. 31, 2003]
[edit reason] No pointing out specifics, please [/edit]
Wouldn't you think it would be relatively simple for G to identify when negative numbers appear (e.g. -220px)
and to identify this as spam.
We, too, have a competitor that spams using this method-has been placing high for c. 5 months now...what a crock!
But their is many things that, I think, can make it quite difficult:
- you can disallow robots from 'reading' your css files
- you can set the css property in the css file to be visible or in the screen and then override it in the page (either in the head tags or in the div properties)
- you can also use javascript to hide or move a layer outside of a page
there are many ways for spammers to trick SE with CSS and/or javascript and I think that asking the robots to 'check' every pages of a site is impossible in term of time, bandwidth and memory. Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.
Leo
The only real way to do this would be to have some form of rendering system within the SE that would render each page on a screen and then only work with the material that was rendered (because this is what we'd see right?)
Basically this would be highly expensive BUT you'd then be the first SE to declare some really high quality results mebbe.
phil
Good calls there Leo
The only real way to do this would be to have some form of rendering system within the SE
Yes, like .. hummmmm ... a human eye ;)
[edited by: le_gber at 10:04 pm (utc) on Mar. 31, 2003]
However, the "it can't be done" kind of thinking is what allowed Google to leapfrog their competitors to start with. If Google can't figure out how to beat the spammers (and I believe they are comitted to doing so), someone else will come along and take over as the leading SE. There are too many genius programmers in the world just dying to show their stuff.
There are too many genius programmers in the world just dying to show their stuff
I'm not one of them :)
It certainly is a challenge for Google to develop and deploy the necessary algos to recognize these kinds of spam
It certainly is if they want to follow W3C guidelines and robots.txt properties. I don't see how, if they base their algo on robots spidering Source Code of HTML file they can achieve that ... like MetropolisRobot said they should render the page first...
leo
They need to bite the bullet and put a few people on as spam coordinators
I'm sure that as DMOZ volunteers, google would find many people willing to do the job for free.
<thought crosses mind>
Why not adding a spam report for affiliate or so ... these people would have to login and fill reports more completely than the current spam report and sent it directly to a small team who would deal with these reports first and 'normal' others reports after. And in case of abuse, removing the 'bad' affiliates. Membership on 'static' email, meaning no free email servers ... what do you think GG?
Spam is getting worse. Why? Well in the words of the Coors advertisement, because you can can can can can can...get away with it.
And if you are a fly by night outfit you can make a killing in a short space of time.
And if you are legitimate you can get a good ole boost to your business and hopefully flip your illegitimacy into legitimacy before you get caught.
As with all things it is an arms race and there are more wannabe spammers and quick buck making merchants than there are policement (google guy-s).
Spammers tend to increase Adword sales by the honest companies that need them to compete. Obviously, there isn't much incentive for Google to throw lots of money at spam fighting if doing so means they will also lose revenues.
That's the way it is, and that's the way it should stay. If you don't like someones site and think they are cheating, there are places you can tell Google.
Automatic detection of CSS and Javascript is completely NOT needed. This would keep everyone programming static websites, and the technologies these days can handle alot more than that. How else could you show tutorials that need animations and written text all veiwable on the screen at the same time?
Manual Tests are the only way it can be done.
as for the guy with the -absolute......
was it out of the page, or out of the division, maybe out of an iFrame. You can never tell, unless you do a manual test. Was it going to appear onevent? People are scared to make interactive sites because of the rumours that Google are going to ban anyone using text in a hidden layer.
I found a nice way to protect my site at the moment, that would certainly pass a manual search, but not an automatic one. And that certainly wont be fair.
Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.
I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.
In that case a css file listed in the HTML that is banned by robots.txt could simply increnemt the probibility value, to a lesser degree. Whereas that coupled with several other factors may push some pages over into 'spam' and 'penalty' areas...
Besides, why would you, legitamtely, block a css file from a bot?
I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.
IMO, google needs to rethink their penalties. When spam can do well until its caught through a manual review, the system is broken. If a penalty was immediately applied under an automatic detection system, no one would risk it.
this may explain why some of my competition is baffling me on their rankings.
When spam can do well until its caught through a manual review, the system is broken.
First, I think it's common knowledge Google is contstantly refining it's algos. While I agree with your post the fact you quotd by last remak above it gives the impression you're for more dacronian penalty conditions....
you can use java to reference a css file
You are correct in that you can use Java and GoogleBot will skip it, but there's also the NOSCRIPT tag as well. Either way my origional question of finding a legitemate reason to block your css file in robots.txt stands.
There are valid reasons to use JS to select your css file (browser weirdness most likely), but you may award some penalty points to it such that when combined with other activity it may trigger a ban.
The thing to note is that there are very few instances where you can say if you see "X" in the HTML (or css mabye) you ban them. Even hidden links may not be hidden (background image, or css settings overriding the HTML).
The goal is to tweak the algo such that a combination of common 'issues' my trigger a penalty or ban. Blocked CSS and seemingly hidden text.
At the same time those outright-wrong and clearly in violation issues should be delt with swiftly. Duplicate content is one that, for exactly duplicate content anyway, is often penalized -- above a threshold.
See in this thread [webmasterworld.com] (msg #50), why I mention my css files in the robots.txt.
Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?
Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?
While there is nothing to index, per se, the contents of the css file, I think, are qute pertinent.
I think that the Googlebot is smart enough not to load the css file (now anyway) becuase the 'link' to the css file is not an <A HREF> tag... I haven't had my CSS file pulled by Googlebot.
It's also interesting which thread you linked to -- the thread is discussing using <H1> tags and css files to decrease the size of the actual tag text. The end result would be the increase Googlebot's 'visibility' of the H1 tag text, while not actually increaseing it's zise when displayed on a browser.
I would consider that a (minor) form of inappropriate cloaking.
I agree with this. I've blocked my css files and numerous pages in my site from being indexed simply because they aren't keyphrase dense.
As an example, I have 6 pages in my site that have nothing but jpegs with very little text. I've learned on this forum that its better to use a NOINDEX, NOFOLLOW tag on these pages, because their lack of conent relevant information would only hurt my site in the long run.
Basically, its better to have 10 pages of useful content than a 15 pages site that might have 5 pages of jpegs, and 10 pages of content. Thus, I only show google the 10 pages and <NOINDEX> the other 5.
IMO
eg: I have several div's on a site with a default of hide, but will be shown when a user clicks on the show/hide link on the same page.
I use this for weblog type pages to display more of an article, or hide/show comments.
Its also used on several of my commerical sites for longer news items and some advertisement delivery.
Will google know this is a legitimate usage, because I have a scripted SHOW link? Or will it only see that the property is set to display:none and classify the site as a spammer?
[google.com...]
What I am seeing an awful lot of at the moment are affililate program spammers - where a 'new' domain goes up every month, ostensibly set up as an affiliate site.
These are all database generated sites - 100 to 700 pages, where each page targets a phrase, each page is spam, and each page provides a redirection to the 'home' page of the 'affiliated' partner.
Quite a neat system (if you like cheats) - because the setup time would be incidental; the domain name is a throwaway (with faked contact info) - and the spammer expects the site to be chucked out after 4 or 5 weeks. If they aren't banned - and it generates traffic (=revenue) for a longer period - what a bonus!
And the perfect twist is that IF Google acts on the "affiliate spammer" - guess what - the "affiliate partner" (ie the guy paying the affiliate spammer for the traffic) gets off scott free because it is only the affiliate partner who gets the penalty.
IMHO you need an 'accessory before the fact' penalty as well. If a site gets more than 3 occurances of affiliates that pull this stunt - then the 'affiliate partner' should also get his site banned. But first - you need to ban the affiliate spammers.
I have submitted details of this activity on numerous occasions - but all the sites I have submitted are still in the index.
Thanks for the opportunity to vent.
: )
Chris
PS Hey GoogleGuy - BTW were you at the Australian Google launch Party last Thursday night? Did I meet you and not know it was you?
PS