Forum Moderators: open
How can Google Reduce it?
How about this. There has been talk of G reading CSS files and I have no doubt they have the technical ability to interpret the contents. Enough to know when H1 has been made to look like p text and divs are hidden etc etc.
I know that to check every page in the index might well melt their servers, so introduce random sampling. Check the odd page here and there and if the flag is raised, by finding naughtiness, then investigate further. Upon something being found a warning shot can be fired using the method of reducing the ranking for that page or site. No need for a warning email, the ranking will do the job.
I can almost guarantee the only people who might moan about this would be the people employing such tactics. There of course will be people who might say that unscrupulous SEO’s have screwed them over, well my opinion on that one is ignorance is no defence. Rather like acting on dodgy tax or legal advice quite often you yourself carry the can.
The upshot would be less people moaning about being beaten in the SERPs by hidden text etc. :)
Right…..back to work
I'm afraid you'd need (much) better reasons than the one you cited before discriminating in the way you suggest. The world of CSS/HTML/Web-Dev doesn't revolve around SEO alone and there are many entirely legitimate reasons for utilising their abilities...
Whats the point of an SE if it wipes out standard-compliantly coded (and legitimate) websites from it's index?
Where does it end, do we start ignoring w3c standards?
Pie in the sky :)
OK there are legitimate usages, such as skip content for accessibility reasons. A method I use myself. But I imagine that that sort of thing could be filterd quite easily. Simply test would be if it was stuffed with H1 when all is needed would be the anchor link etc etc.
Dont get me wrong, I just intended to stir up the conversation. A what if, so to speak.
well maybe.. but reduce is the operative word and I wouldn't think by much.. there are many ways of "being naughty" that I don't think even a filter could detect so that brings us back to "spam reporting" methods and as this is still in debate .. I'm not going there!
.. they haven't even managed to filter out hidden HTML/text links (non CSS) so I doubt that CSS issues will be any easier after all they'd have to teach the bot how the "cascade" works first ;) it's not just a simple case of reading one line of code...
And as theWhippinPost says there are too many accessibility issues that are going to crop up in the near future that will make them have to be very careful about blanket filters.. if a government site has to use CSS for accessibility/usability by law, and they employ a legitimate technique within that then get caught in a blanket filter of some sort .. that's gonna be a whole new can of worms
Interesting times ahead ;)
Suzy
Does Google ignore the respect? Or, does Google drop pages where the CSS cannot be indexed due to blocking server side or by robots.txt?
I am with you in some ways; hidden text is an easy thing to get away with using CSS.
But your comment on stuffed H1 tags could be simple enough for Google to spot anyway - the html code would probably give most of the clues anyway with a high percentage of the page covered in H tags, but few paragraph text blocks.
It's a complicated issue and I imagine Google have been or are discussing the implentation and whether they consider it reasonable.
Random checking sounds ultra-difficult and unefficient, however, if one site on a particular IP address utilises a bad technique, then other sites on the same IP could be owned by the same person - they could be checked. All the sites that they link to or from could also be put on a list, thus interlinked sites using the same 'cheating' methods would be brought down together.
It would make sense to me that Google see the text for what it is. So, if the font size used was the same for H1 and P tags, then it needs to view the H1 text in that page as equal to P and no higher. Not penalise.
Hidden text, well that could be a penalty - but then it has to consider :hover, :link, and so on....
Possible? Yes. Likely? No
If you have your CSS code within your HTML document, it would be no challenge for google to read it as you stated but as long as you keep it in a seperate doc, I doubt very much that google will be reading it any time soon.
The reason I say this is because according to Larry Pages' original google document, google does not process documents on the fly but stores documents to be indexed in its repository first. Documents are then pulled from the repository when its time for indexing and sorting.
They would need to make changes to the whole flow of their indexing/sorting process which IMHO is unlikely. I dont think google is about to start caching and indexing CSS files (or including them in their SERPs) anytime soon.
Just my 2cents... :)
How could google tell the difference between the following :
1) A hidden link that contains a <H1> tag in order to get better keyword ranking.
2) A hidden link containing a <H1> tag is used as an alternative to an image (e.g. see zeldman.com)
We can tell the difference becasue we are human but to a bot they would look almost identical, especially is the cheaters started hiding their <H1> tags behind real images on their site.
You also could not penalise someone using <h1> tags with the same size font as <p> tags. Headers are designed to mark up a semantic difference and not a visual one, so if a designer wants to supress the visual aspect then so be it.
Admittedly this will be very uncommon for <h1> but is very likely for <h3> down. In fact if you imagine a technical document, it would have a large amount of information that is not actually contained in the document, but rather describes it. Such as :
Title
Category
Applicable to...
Author
Proposer/Designer/Implementer/Tester/Commisioner
Time/Date
Review Cycle
etc.
To my mind these form part of the header information for that document, yet you would not want to display most of them in more than 1em size, possibly with some other form of dcoration to differrentiate them from the actual text.
For example I commonly include the Author and date/time at levels <h4> and <h5>, with the only difference between those and the main text being that they are italicised (is that a word?) and slightly greyed. Now <h4> is already far down the tree for an important piece of information such as document author and search engines will be less favourable with it as it is. But to remove any sense of it being important becasue I size it at 1em would be ludacrous.
The whole Idea of CSS is to seperate formatting from the content, Search Engines should be concentrating on the content.
I don't know how any engine is going to reduce the CSS cheating but they can't dop it by making judgement calls on the formatting of a document without blanketing out many legitimate ones.
However I can see the possibility of blanket bans for sites with multiple <h1> tags, or even by checking the proportion of header tags to content, perhaps more than 10% of a page in headers would contitute a good cut off point (plucked from thin air).
But to remove any sense of it being important becasue I size it at 1px would be ludacrous?
> stores documents to be indexed in its repository first. Documents are then pulled from the repository when its time for indexing and sorting.
But surely google could read the CSS and store it with the page in the repository. If they have enough space (of course!). Technically, it would be simple to index it, in reality it may be too much of a load and too much storage space required for the little benefit it may give.
they haven't even managed to filter out hidden HTML/text links
Suzy, I agree wholeheartedly about accessibility and the issues that may arise and the fact that a blanket ban for all abuse (whatever that is :)) could seriously cause havoc.
In general I am referring to the real bad boys of hidden stuff. Don’t get me wrong this is not a crusade but more of a what if.
Many people optimising these techniques may block the CSS file
Random checking sounds ultra-difficult and unefficient
They would need to make changes to the whole flow of their indexing
You also could not penalise someone using <h1> tags with the same size font as <p> tags
I don't know how any engine is going to reduce the CSS cheating but they can't dop it by making judgement calls on the formatting of a document without blanketing out many legitimate ones.
Cheers
1px is not 1em. The earlier reference was :
Enough to know when H1 has been made to look like p text
and in fact you yourself said :
So, if the font size used was the same for H1 and P tags, then it needs to view the H1 text in that page as equal to P and no higher. Not penalise.
Obviously making text 1px is very different from making it the same size as your normal paragraphed text. You cannot say that a header tag is only as important as your normal text just because of the difference in size. Neither can you judge by colour, text-decoration, weight or any other visual means.
The header tags are there to mark up headers, how that is displayed should in an ideal world have no bearing on it's importance in a document. The fact that some people abuse this will have to be solved some other way unless we plan on abandoning the whole idea of symantic markup.
Absolutely, storage issues aside, google's claim to fame so to speak were speed and quantity. Google indexes pages at blazing speed sequentially from the repository and this indexing process is totally independant of other documents.
Introducing an external CSS file into the equation would slow down the indexing process significantly because each time a CSS file was required, googles indexer would need to search for it causing overhead in seek time, decompress it, then merge it with the original document, calculate all the extra css penalties and continue with the rest of the sorting function.
To save time, google doesnt store files in any particular order in the repository, they are just packed in one after the other and processed one by one. Searching through Terrabytes for one CSS file is time costly.
>>the possibility of blanket bans for sites with multiple <h1> tags
IMHO blanket bans are just too tricky to do correctly and already stated in your post, innocent victims may get penalized.
I think google will just ontinue with its existing strategy of "Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help"
They might like to play around with the weight factor given to each of the different tags but thats about it.
I think Google will rely on people reporting websites for spamming the engine, but I hope they will look on hidden text in the way they look on alt text.
Do reported pages get a human review? I assume they do.
I hope they do this with an open mind because I love using hidden text to aid accesibility. You can give graphical browsers pretty pictures, and text browser properly laid out text.
It's much neater than using alt tags.
Enough to know when H1 has been made to look like p text and divs are hidden etc etc.
that wouldn't help much
With javascript you can change whatever you want undetectable.
You can even create code on the fly (with eval) or hide things after an event (timeout, mouse move etc).
Honestly, G hat better ignore links in NOSCRIPT and NOFRAMES, that would be much easier with zilch colateral damage.
If it is hidden in JS it wont really be part of the page and wont really count towards ranking. Now a redirect is a different beast alltogether as the code of the page is still seen but the bot is not redirected. But that moves away from the thread.
I suppose you could go after redirects as well :) :)
I think Plasma was thinking the reverse, not to write in extra H tags for google, but to hide them from a user.
exactly
I was talking about code that will be generated on the fly triggered by an event like a mouse move.
The code to modify the css isn't there at loading time, only after moving the mouse (or whatever non-foreseeable event) the code will be generated.
E.g. you have the real code as text in small chunks, on event -> eval
If G would examine the code it could impossibly detect it.
Keeping the JS in a seperate script and you have the same effect as CSS
But most likely it was more a matter of keyword stuffing in the span
Now there (IMO) you have it!
NO "spamming" technique will work unless you actually find the SE's achilles heel in the first place so, and as I've said somewhere before, I think that *G* especially are trying to attack from that angle (florida is a good example) as opposed to trying to implement filters whether they are HTML or CSS..
I don't think they want to read/parse CSS files they want to get their model right from another angle, what that is is still open to interpretation.
Suzy
It is Google's rules we obey, if they change then so must we.
UKGimp> I have to disagree. If we structure documents properly and ignore Google then Google will have to find inventive ways to change their rules to find the highest quality websites. We make Google what it is far more than Google makes us what we are. If all Google ends up with - when quality sites refuse to resort to tricks - is spam sites which exploit the algo of the month, it forfeits its position. Google knows this: hence Florida.
Chasing search engines has to be less productive in the long run than forcing the search engines to chase you. If Google had faith in algorithms alone, it wouldn't put so much weight on dmoz.
If CSS and Javascript abuse evolves to the extent where no tag can be trusted, Google will cease to take tags into account. It's already a matter of debate if Google even uses PageRank anymore - if it was so valuable to them, would they not have struck a financial deal with Stanford to assume responsibility for the patent?
The extent of abuse at the moment is not dissimilar to the meta-tag keyword stuffing which brought down altavista. Google is cautious enough that it will not willingly befall the same fate.
"display: none" probably has no meaning to it's algo as it's not important... it's the keyword abuse that is.
It is Google's rules we obey, if they change then so must we
In the context of keyword-abuse yeah - The most they'll likely do is to just ignore, not penalise, content within "risky" CSS rules... kind'a like wiping out any hiding place.
Bet ya wish you hadn't started this now eh :D
NO "spamming" technique will work unless you actually find the SE's achilles heel in the first place so, and as I've said somewhere before, I think that *G* especially are trying to attack from that angle
If they could add css parsing with the odd check here and there it would certainly help but as you allude to the algo must be flawed in some way if it allows certain things.
UKGimp> I have to disagree.
Some good points on chasing algos, algos which I personally only followed to an extent. Markup of things I am involved in have stayed exactly the same and things are going quite well. No drop for anything. I even link to to PR0 pages if the content is good as the TBPR means bugger all to me. Cast your eyes over an anti linking sentiment [webmasterworld.com] thread I started. So hopefully you can see my stance on algo chasing. :)
The most they'll likely do is to just ignore, not penalise
Bet ya wish you hadn't started this now eh :D
IeuanJ, 1px is not 1em, agreed.
But 1em should not be penalised....
Also Agreed
....but if it is the same size text as P text, why should google give it extra weight? (Unless maybe if it is bold or underline etc...).
If you had read my comments properly you would see why. it should be given extra weight because it is a HEADER, not a normal Paragraph.
Say I make all my H4 tags to be the same size, weight and decoration as my p tag, I could still easily differentiate it by changing the background colour. To the eye it would stand out as a header because it is different, to the machine it will stand out because it is wrapped in a header tag. For an example look at this site, when you are at the posting screen the text "A pie in the sky solution" could easily be considered a piece of header information, yet it is the same size as the main bulk of the text. On the forums it is actually smaller! Does this mean that search engines should not give it more significance?
If you take away the meaning of the header tags for any reason you are in effect breaking the semantic layout of the document.
Besides if you choose to discriminate based on text size then the cheats will just use different measurement units to resize them the same (10pt = x-em)
>> If you take away the meaning of the header tags for any reason you are in effect breaking the semantic layout of the document. <<
I guess that google can see the difference in a document where 10% of text is inside heading tags, and 90% inside paragraphs, and lists, as opposed to most of the text being inside heading tags, or having heading tags inside paragraphs. The former is valid well-formed semantic markup. The latter is usually spam.
It cannot and will not be run as part of indexing. That is too processor intensive.
It will not be run with the Googlebot user agent.
It will be run on systems with specially modified browsers with a human in attendance. All CSS and Javascript will be run. It is not a bot, but a browser, so it does not have to respect the robots.txt. All pages are actually rendered.
the browser would flag when certain things crossed a threshold. It is an aid to manual checks, greatly increasing their speed.
As they improve it, it will be able to catch more forms of cheating. They can also decide on differnent ways to feed pages to check into it.
It doesn't sound like they are using it much right now, but they certainly proved that it does work. And they can start using it again whenever they want, and they can add whatever changes they want to be able to match the trick du jour.
Is there anything specific you would parse CSS files for, hints of hidden text maybe?
I've been making a keyword prominence/density analyzer, it might be worth parsing CSS files, just for font-sizes for a "prominence score". The original backrub paper says that font-size is taken into account, and I guess to a degree it could apply to any word prominence measurement...
I'd have to read a CSS book or perch myself in the CSS forum to learn more about other CSS attributes et al, but grabbing font-sizes seems easy enough and useful, though finding hidden text sounds a bit (far!) more complex.
text.hidden = false; yay!
Good Text! Include this text in your algo
text.hidden = true; boo!
Bad Text! Don't include this text in your algo
Simple. Problem solved. Where's my google job?