I don't mean to sound rude here but have you thought this through?
I'm afraid you'd need (much) better reasons than the one you cited before discriminating in the way you suggest. The world of CSS/HTML/Web-Dev doesn't revolve around SEO alone and there are many entirely legitimate reasons for utilising their abilities...
Whats the point of an SE if it wipes out standard-compliantly coded (and legitimate) websites from it's index?
Where does it end, do we start ignoring w3c standards?
>>have you thought this through?
Pie in the sky :)
OK there are legitimate usages, such as skip content for accessibility reasons. A method I use myself. But I imagine that that sort of thing could be filterd quite easily. Simply test would be if it was stuffed with H1 when all is needed would be the anchor link etc etc.
Dont get me wrong, I just intended to stir up the conversation. A what if, so to speak.
>> Can Google reduce it?
well maybe.. but reduce is the operative word and I wouldn't think by much.. there are many ways of "being naughty" that I don't think even a filter could detect so that brings us back to "spam reporting" methods and as this is still in debate .. I'm not going there!
.. they haven't even managed to filter out hidden HTML/text links (non CSS) so I doubt that CSS issues will be any easier after all they'd have to teach the bot how the "cascade" works first ;) it's not just a simple case of reading one line of code...
And as theWhippinPost says there are too many accessibility issues that are going to crop up in the near future that will make them have to be very careful about blanket filters.. if a government site has to use CSS for accessibility/usability by law, and they employ a legitimate technique within that then get caught in a blanket filter of some sort .. that's gonna be a whole new can of worms
Interesting times ahead ;)
The trouble with checking CSS is respecting robots.txt. Many people optimising these techniques may block the CSS file.
Does Google ignore the respect? Or, does Google drop pages where the CSS cannot be indexed due to blocking server side or by robots.txt?
I am with you in some ways; hidden text is an easy thing to get away with using CSS.
But your comment on stuffed H1 tags could be simple enough for Google to spot anyway - the html code would probably give most of the clues anyway with a high percentage of the page covered in H tags, but few paragraph text blocks.
It's a complicated issue and I imagine Google have been or are discussing the implentation and whether they consider it reasonable.
Random checking sounds ultra-difficult and unefficient, however, if one site on a particular IP address utilises a bad technique, then other sites on the same IP could be owned by the same person - they could be checked. All the sites that they link to or from could also be put on a list, thus interlinked sites using the same 'cheating' methods would be brought down together.
It would make sense to me that Google see the text for what it is. So, if the font size used was the same for H1 and P tags, then it needs to view the H1 text in that page as equal to P and no higher. Not penalise.
Hidden text, well that could be a penalty - but then it has to consider :hover, :link, and so on....
>>There has been talk of G reading CSS files
Possible? Yes. Likely? No
If you have your CSS code within your HTML document, it would be no challenge for google to read it as you stated but as long as you keep it in a seperate doc, I doubt very much that google will be reading it any time soon.
The reason I say this is because according to Larry Pages' original google document, google does not process documents on the fly but stores documents to be indexed in its repository first. Documents are then pulled from the repository when its time for indexing and sorting.
They would need to make changes to the whole flow of their indexing/sorting process which IMHO is unlikely. I dont think google is about to start caching and indexing CSS files (or including them in their SERPs) anytime soon.
Just my 2cents... :)
Unfortunately many people use these techniques without intending to cheat the system.
How could google tell the difference between the following :
1) A hidden link that contains a <H1> tag in order to get better keyword ranking.
2) A hidden link containing a <H1> tag is used as an alternative to an image (e.g. see zeldman.com)
We can tell the difference becasue we are human but to a bot they would look almost identical, especially is the cheaters started hiding their <H1> tags behind real images on their site.
You also could not penalise someone using <h1> tags with the same size font as <p> tags. Headers are designed to mark up a semantic difference and not a visual one, so if a designer wants to supress the visual aspect then so be it.
Admittedly this will be very uncommon for <h1> but is very likely for <h3> down. In fact if you imagine a technical document, it would have a large amount of information that is not actually contained in the document, but rather describes it. Such as :
To my mind these form part of the header information for that document, yet you would not want to display most of them in more than 1em size, possibly with some other form of dcoration to differrentiate them from the actual text.
For example I commonly include the Author and date/time at levels <h4> and <h5>, with the only difference between those and the main text being that they are italicised (is that a word?) and slightly greyed. Now <h4> is already far down the tree for an important piece of information such as document author and search engines will be less favourable with it as it is. But to remove any sense of it being important becasue I size it at 1em would be ludacrous.
The whole Idea of CSS is to seperate formatting from the content, Search Engines should be concentrating on the content.
I don't know how any engine is going to reduce the CSS cheating but they can't dop it by making judgement calls on the formatting of a document without blanketing out many legitimate ones.
However I can see the possibility of blanket bans for sites with multiple <h1> tags, or even by checking the proportion of header tags to content, perhaps more than 10% of a page in headers would contitute a good cut off point (plucked from thin air).
> But to remove any sense of it being important becasue I size it at 1em would be ludacrous.
But to remove any sense of it being important becasue I size it at 1px would be ludacrous?
> stores documents to be indexed in its repository first. Documents are then pulled from the repository when its time for indexing and sorting.
But surely google could read the CSS and store it with the page in the repository. If they have enough space (of course!). Technically, it would be simple to index it, in reality it may be too much of a load and too much storage space required for the little benefit it may give.
|they haven't even managed to filter out hidden HTML/text links |
Yes but on a large scale, across the index. However this crazy idea of mine involves random sampling, maybe even in certain industries?
Suzy, I agree wholeheartedly about accessibility and the issues that may arise and the fact that a blanket ban for all abuse (whatever that is :)) could seriously cause havoc.
In general I am referring to the real bad boys of hidden stuff. Donít get me wrong this is not a crusade but more of a what if.
|Many people optimising these techniques may block the CSS file |
Badda bing, thatís the first flag. How easy is that.
|Random checking sounds ultra-difficult and unefficient |
I donít really see why? The amount of power they have at their disposal, just send GoogleSampleBot out there. :)
|They would need to make changes to the whole flow of their indexing |
No real problem there, it was not long ago that we had monthly updates and they managed to pull the rolling cycle routine.
|You also could not penalise someone using <h1> tags with the same size font as <p> tags |
Fair point, which is sort of covered by a later suggestion of yours and some sort of threshold.
|I don't know how any engine is going to reduce the CSS cheating but they can't dop it by making judgement calls on the formatting of a document without blanketing out many legitimate ones. |
They can and have done exactly that. There was a huge thread on the filtering of guestbooks and how it was much easier to bear down on certain page titles and filter them rather than doing major work on the algo. Now I know that Guestbooks are not really the same thing, but the point of G not introducing blanket approaches does not really stand tall IMHO.
1px is not 1em. The earlier reference was :
Enough to know when H1 has been made to look like p text
and in fact you yourself said :
So, if the font size used was the same for H1 and P tags, then it needs to view the H1 text in that page as equal to P and no higher. Not penalise.
Obviously making text 1px is very different from making it the same size as your normal paragraphed text. You cannot say that a header tag is only as important as your normal text just because of the difference in size. Neither can you judge by colour, text-decoration, weight or any other visual means.
The header tags are there to mark up headers, how that is displayed should in an ideal world have no bearing on it's importance in a document. The fact that some people abuse this will have to be solved some other way unless we plan on abandoning the whole idea of symantic markup.
>>But surely google could read the CSS and store it with the page
Absolutely, storage issues aside, google's claim to fame so to speak were speed and quantity. Google indexes pages at blazing speed sequentially from the repository and this indexing process is totally independant of other documents.
Introducing an external CSS file into the equation would slow down the indexing process significantly because each time a CSS file was required, googles indexer would need to search for it causing overhead in seek time, decompress it, then merge it with the original document, calculate all the extra css penalties and continue with the rest of the sorting function.
To save time, google doesnt store files in any particular order in the repository, they are just packed in one after the other and processed one by one. Searching through Terrabytes for one CSS file is time costly.
>>the possibility of blanket bans for sites with multiple <h1> tags
IMHO blanket bans are just too tricky to do correctly and already stated in your post, innocent victims may get penalized.
I think google will just ontinue with its existing strategy of "Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help"
They might like to play around with the weight factor given to each of the different tags but thats about it.
>>innocent victims may get penalized
In a devil's advocate role: When has the possibility of the innocent getting penalised ever made a difference?
It is Google's rules we obey, if they change then so must we.
a few innocent ones here and there...
google wouldnt flinch.
crippling websites for the blind and other govt sites...
another bag of PR (public relations not PageRank) worms
I started a thread with a similar question this morning but the mods flagged it for review.
I think Google will rely on people reporting websites for spamming the engine, but I hope they will look on hidden text in the way they look on alt text.
Do reported pages get a human review? I assume they do.
I hope they do this with an open mind because I love using hidden text to aid accesibility. You can give graphical browsers pretty pictures, and text browser properly laid out text.
It's much neater than using alt tags.
|Enough to know when H1 has been made to look like p text and divs are hidden etc etc. |
that wouldn't help much
You can even create code on the fly (with eval) or hide things after an event (timeout, mouse move etc).
Honestly, G hat better ignore links in NOSCRIPT and NOFRAMES, that would be much easier with zilch colateral damage.
>>you want undetectable.
If it is hidden in JS it wont really be part of the page and wont really count towards ranking. Now a redirect is a different beast alltogether as the code of the page is still seen but the bot is not redirected. But that moves away from the thread.
I suppose you could go after redirects as well :) :)
IeuanJ, 1px is not 1em, agreed.
But 1em should not be penalised, but if it is the same size text as P text, why should google give it extra weight? (Unless maybe if it is bold or underline etc...).
1px should be penalised - this is spam.
>If it is hidden in JS.... wont really count towards ranking
I think Plasma was thinking the reverse, not to write in extra H tags for google, but to hide them from a user.
Keeping the JS in a seperate script and you have the same effect as CSS
|I think Plasma was thinking the reverse, not to write in extra H tags for google, but to hide them from a user. |
I was talking about code that will be generated on the fly triggered by an event like a mouse move.
The code to modify the css isn't there at loading time, only after moving the mouse (or whatever non-foreseeable event) the code will be generated.
E.g. you have the real code as text in small chunks, on event -> eval
If G would examine the code it could impossibly detect it.
|Keeping the JS in a seperate script and you have the same effect as CSS |
Even worse, because it's not detectable by an algo.
Fair enough Plasma
end of my day, was not thinking well :)
A competitor of mine made EXTENSIVE use of all sorts of css spamming in span and div tags.
After florida he dropped like a rock from a very hot 2 word term. But most likely it was more a matter of keyword stuffing in the span that triggered penalty rather that Gbot's css literacy...
|But most likely it was more a matter of keyword stuffing in the span |
Now there (IMO) you have it!
NO "spamming" technique will work unless you actually find the SE's achilles heel in the first place so, and as I've said somewhere before, I think that *G* especially are trying to attack from that angle (florida is a good example) as opposed to trying to implement filters whether they are HTML or CSS..
I don't think they want to read/parse CSS files they want to get their model right from another angle, what that is is still open to interpretation.
|It is Google's rules we obey, if they change then so must we. |
UKGimp> I have to disagree. If we structure documents properly and ignore Google then Google will have to find inventive ways to change their rules to find the highest quality websites. We make Google what it is far more than Google makes us what we are. If all Google ends up with - when quality sites refuse to resort to tricks - is spam sites which exploit the algo of the month, it forfeits its position. Google knows this: hence Florida.
Chasing search engines has to be less productive in the long run than forcing the search engines to chase you. If Google had faith in algorithms alone, it wouldn't put so much weight on dmoz.
The extent of abuse at the moment is not dissimilar to the meta-tag keyword stuffing which brought down altavista. Google is cautious enough that it will not willingly befall the same fate.
I think we mustn't forget that a bot reads the source, not what appears on-screen. It's algo therefore will be attuned to keyword-spamming-patterns within that source - Abuse of document structural elements such as <h1> can still be measured regardless of it's (CSS) formatting.
"display: none" probably has no meaning to it's algo as it's not important... it's the keyword abuse that is.
|It is Google's rules we obey, if they change then so must we |
In the context of keyword-abuse yeah - The most they'll likely do is to just ignore, not penalise, content within "risky" CSS rules... kind'a like wiping out any hiding place.
Bet ya wish you hadn't started this now eh :D
|NO "spamming" technique will work unless you actually find the SE's achilles heel in the first place so, and as I've said somewhere before, I think that *G* especially are trying to attack from that angle |
If they could add css parsing with the odd check here and there it would certainly help but as you allude to the algo must be flawed in some way if it allows certain things.
|UKGimp> I have to disagree. |
Good on you sir :)
Some good points on chasing algos, algos which I personally only followed to an extent. Markup of things I am involved in have stayed exactly the same and things are going quite well. No drop for anything. I even link to to PR0 pages if the content is good as the TBPR means bugger all to me. Cast your eyes over an anti linking sentiment [webmasterworld.com] thread I started. So hopefully you can see my stance on algo chasing. :)
|The most they'll likely do is to just ignore, not penalise |
In a round about way that will be seen as a penalty. Are sites really penalised when they take a dive as in some recent major changes or are they just falling foul of a filter. My stance is of filter, not penalty.
|Bet ya wish you hadn't started this now eh :D |
Itís been fun. :). Mainly a devilís advocate type of post.
IeuanJ, 1px is not 1em, agreed.
But 1em should not be penalised....
....but if it is the same size text as P text, why should google give it extra weight? (Unless maybe if it is bold or underline etc...).
If you had read my comments properly you would see why. it should be given extra weight because it is a HEADER, not a normal Paragraph.
Say I make all my H4 tags to be the same size, weight and decoration as my p tag, I could still easily differentiate it by changing the background colour. To the eye it would stand out as a header because it is different, to the machine it will stand out because it is wrapped in a header tag. For an example look at this site, when you are at the posting screen the text "A pie in the sky solution" could easily be considered a piece of header information, yet it is the same size as the main bulk of the text. On the forums it is actually smaller! Does this mean that search engines should not give it more significance?
If you take away the meaning of the header tags for any reason you are in effect breaking the semantic layout of the document.
Besides if you choose to discriminate based on text size then the cheats will just use different measurement units to resize them the same (10pt = x-em)
>> If you take away the meaning of the header tags for any reason you are in effect breaking the semantic layout of the document. <<
I guess that google can see the difference in a document where 10% of text is inside heading tags, and 90% inside paragraphs, and lists, as opposed to most of the text being inside heading tags, or having heading tags inside paragraphs. The former is valid well-formed semantic markup. The latter is usually spam.
If this were to be implemented, and it has, as was demonstrated last spring when the ran the hidden text checks on pages that were reported, most of the arguments against are muted.
It cannot and will not be run as part of indexing. That is too processor intensive.
It will not be run with the Googlebot user agent.
the browser would flag when certain things crossed a threshold. It is an aid to manual checks, greatly increasing their speed.
As they improve it, it will be able to catch more forms of cheating. They can also decide on differnent ways to feed pages to check into it.
It doesn't sound like they are using it much right now, but they certainly proved that it does work. And they can start using it again whenever they want, and they can add whatever changes they want to be able to match the trick du jour.
|brotherhood of LAN|
Is there anything specific you would parse CSS files for, hints of hidden text maybe?
I've been making a keyword prominence/density analyzer, it might be worth parsing CSS files, just for font-sizes for a "prominence score". The original backrub paper says that font-size is taken into account, and I guess to a degree it could apply to any word prominence measurement...
I'd have to read a CSS book or perch myself in the CSS forum to learn more about other CSS attributes et al, but grabbing font-sizes seems easy enough and useful, though finding hidden text sounds a bit (far!) more complex.
It shouldn't be complex at all. It's easy to determine the content of a page and even a list of links (like menus). It's also not too hard to determine the properties of text.
text.hidden = false; yay!
Good Text! Include this text in your algo
text.hidden = true; boo!
Bad Text! Don't include this text in your algo
Simple. Problem solved. Where's my google job?
| This 45 message thread spans 2 pages: 45 (  2 ) > > |