Forum Moderators: open
I was at first happy with it but later I think, would it be too smart until in filter up innocent web sites?
There is really a lot to consider when filtering up hidden-text. For example,
black cell + black text = hidden
but making a cell black can be done from the TD properties, or table properties, or even style sheets, or a black image as the background ...
So I think they have problem filtering up 10% of the spams.
2nd, what if you an innocent webmaster? You text is default to black, but in a table, you want to make some white text over black background and you make the table black by a stylesheet or something where the algo doesn't aware of. So what gogle think?
white text + white background = spam!
I just hope google will take lots of precaution on this. I rather have some spams left instead of losing out some innocent web sites.
But Iam of course curios how the algo works, and wether it really works totally automatic, or triggers human review; wether it only kicks in at a certain percantage of hidden text of the whole content on the page; wether it will fetch .CSS files (even those disallowed by robots.txt) etc. etc.
But I guess we'll just have to wait and see. Anways we'll have lots of stuff to test and discuss... :)
Isn't not beeing ranked high a bad enough penalty?
And isnT' beeing indexed, but not ranked high a million tiems better then beeing innocently penalised?
I think it's a good compromise.
SN
Google CAN'T develop the same algorithm -> unless they want to violate Inktomi's patent.
they have to go at it a different way.
So until they figure out how to circumvent the other patent, there is NO way they will automatically detect hidden text, etc.
Inktomi Patented hidden text detection through a variety of means, in 1999.
They may have at that but IT DOESN'T WORK and neither does Google's approach, or lack thereof. It's amazing that perhaps one of the first SE Spam techniques still pays dividends!
Simply ignore the text that "offends" the hidden-text recognition algo.
Brilliant idea Killroy, that's more than fair.
I do not expect them to run the hidden text algo on every singel page as they index it. That would be incredibly compute intensive, even for google. I think they will be running it in a somewhat constant fashion on top search results.
The comparison with the expired domains is not correct, because they are not through with implementing that filter yet. While you may be taking an unfair hit till they complete the work on the filter, you have been getting an unfair boost since you bought the domain.
1. Visit page
2. Print it out
3. Scan the printed page
4. Use a text interpreter (free with most scanners these days)
5. Set algorythm to work on resulting file.
Of course Google would never do it like that exactly, but you can see that it doesn't actually require massively complex AI.
Improving Pseudo-Relevance Feedback in Web Information Retrieval [research.microsoft.com]
The fourth page gives a good idea of how a visual approach could uncover hidden text.
and China has the most fuzzy scientists in the world. :) that paper points to Chinese engineering, research more than anything...
Perhaps with different thinking, a better method can be uncovered to handle this 'problem'?
As I mentioned, though - Google may have their own detection method, and it very well could be that the results are more or less the same, hidden text or not.
After all, if the hidden text is relevant to the page, and the page is relevant to the query, and in fact, the "best" page for the query, why would Google penalize that web page?
In this instance, it would make no sense, and might even hurt their search results, if they banned such hidden text...
The more people try to cheat, the worse their resultsx in the SERPs, yet their pages would still be in the index and they would still have the same PR.
It would even make a great defense against google bombing. All Microsoft would have to do to never come up for "go to hell" is hide the word "hell" on their webpage.
Because individuals using text-to-speech to surf the Web are not getting quality results.
I guess it's relevant for them to hear "widget" this, "widget" that repeated 2500 times, but is it the "best" page for them?
Anyway, even though Googleguy called it an automated check, I think the intention must basically be to flag sites for review, rather than to shoot first and ask questions later.
AFAIK Google doesn't currently parse CSS or Javascript or layers, and while they can download background images they don't do any image processing other than to reduce them to thumbnail size. It would be incredibly difficult to add all these capabilities, and in many cases their spider will not be permitted to crawl the necessary files anyway.
Maybe someone can ask about specifics on Saturday, per Googleguy's suggestion.
Google may not care very much if an algo hurts this or that "good" page. They just need to return a SERP that excludes irrelevant pages for their user.
So if this or that "good" page doesn't make the cut because of some algo accident or another - that still doesn't really hurt Google's business purpose. As long as all the pages that DO make it are all good results for the search, Google will thrive. They don't need to return ALL the good pages to serve the public well.
At least if I were creating a filter to catch abuses, I would be thinking this way. But as website owners, we tend to think in terms of how good and valuable our ONE page is.
Google may not be able to afford that kind of thinking. They need to think about presenting a good collection of search results, even though a particularly "good" page may not be there.
It would be a trade off. If the algo eliminates bogus results and returns good ones -- then even though it misses one or two "really good" pages -- that may be a better situation than an algo that returns some really good pages but mixed in with garbage.
But if the new hidden text detection algo has been implemented - from the limited testing I've done so far - I don't think a lot of it. It still allows the 'king of usability' to rank very highly for the misspellings of his name - which are only contained in microtext at the bottom of his page, as previously discussed at
[webmasterworld.com...]
I think an algo that ignores anything which is too small a font size; or is dubious in terms of text colour - rather than wholesale banning - would be a great step forward.
Of course, the one they may never find a solution for is the web page that is black, has text that needs to be hidden in green, and uses a 20x20 graphics in the same color green as a background image... that one is pretty damn hard to beat without alot more work.
Alex
After all, if the hidden text is relevant to the page, and the page is relevant to the query, and in fact, the "best" page for the query, why would Google penalize that web page?
It doesn't have to be a question of penalising, or even of 'ignoring', its just a question if indexing the page as it shows up in a browser, rather than indexing the source code of a page.
This would be a major shift in thinking, but that is what google is good at.
[webmasterworld.com...]
More importantly, what happens if there are subtle differences between the way IE displays a page and the way google renders a page? A table that is black in IE might be colored in google-rendered world. Then you get people being punished for no good reason.
Alex
Google does not have to tun their spam checkers on every page every month.
They can keep track of their search results and feed the results of first page results into their spam check queue, and filter out those that have already been checked.
Let's face it, no one cares about the unsuccessful spammers. If you get rid of the hidden text spammers that make it to the front page of the SERPs, you would get rid of most of the hidden text SPAM complaints.
And if Google uses Mozilla as it's rendering engine (which would be the sensible thing) and something comes up as hidden text on Moz, I think they should take it out of the top results. That page is automatically garbage for 15% of the users.
As soon as you render a page, some people will be happy, some people will be sad, and some will lose rankings without ever knowing why. I will bet you that most people designing sites don't happen to have a copy of NS 4.x lying around to check compatibility of their sites. If they did, they would know that it doesn't handle coloring of cells in a table very well. That alone could create "hidden text" that isn't hidden - just html not being interpreted correctly.
I still think google would do better at finding "tricks" and hidden text by having people actually REVIEW the higher SERPs. Even if you just review the top 20 for 1000 searches, you will look at 20k pages and almost certainly dump some domains... and in that, you will trap them out of other terms.
Remove the economic cycle of "buy domain, spam domain, get good listings for a couple of times until a new algo comes out, then do it again with another domain" which seems to be the case right now.
Alex