automatic hidden-text detection algorithms

Forum Moderators: open

Message Too Old, No Replies

automatic hidden-text detection algorithms

AthlonInside

5:31 am on Apr 24, 2003 (gmt 0)

GG mention about this in [webmasterworld.com...]

I was at first happy with it but later I think, would it be too smart until in filter up innocent web sites?

There is really a lot to consider when filtering up hidden-text. For example,

black cell + black text = hidden

but making a cell black can be done from the TD properties, or table properties, or even style sheets, or a black image as the background ...

So I think they have problem filtering up 10% of the spams.

2nd, what if you an innocent webmaster? You text is default to black, but in a table, you want to make some white text over black background and you make the table black by a stylesheet or something where the algo doesn't aware of. So what gogle think?

white text + white background = spam!

I just hope google will take lots of precaution on this. I rather have some spams left instead of losing out some innocent web sites.

Krapulator

6:17 am on Apr 24, 2003 (gmt 0)

Yes, I was intrigued by this comment by GG. I am eagerly anticipating to see how this is applied. If youre not doing anything wrong, don't worry about it. Google are clever boffins, I'm sure they would have tested this very thouroughly as to not impact legit sites.

AAnnAArchy

7:18 am on Apr 24, 2003 (gmt 0)

Krapulator <<Google are clever boffins, I'm sure they would have tested this very thouroughly as to not impact legit sites.>>

I dunno, ask them about their expired domain penalties. ;)

ruserious

8:08 am on Apr 24, 2003 (gmt 0)

Since we don't have any influence om the algo, you have to make sure your pages are as "clear" on this matter as possible. I have white text on a background-image in table-cells as Main-Navigation. To reduce ambiguity for a possible algo, I made sure, that the navigation works just as well without images (cell-background-color set to sth. similar to the image). And it even works without CSS, because both colors are defined in css, if it won't recognize the green background, it also won't make the text-color white.

But Iam of course curios how the algo works, and wether it really works totally automatic, or triggers human review; wether it only kicks in at a certain percantage of hidden text of the whole content on the page; wether it will fetch .CSS files (even those disallowed by robots.txt) etc. etc.

But I guess we'll just have to wait and see. Anways we'll have lots of stuff to test and discuss... :)

killroy

8:45 am on Apr 24, 2003 (gmt 0)

Hmm, here's a thought, don't auto-trigger penalties. Simply ignore th etext that "offends" the hidden-text recognition algo. Legit sites won'T be penalised, unless they got the whole text content of the page in such mushy ways taht google won'T see it. And spammy sites get no benifit, so they won't be on top. Unless they manage to get to the top with the non-spammy parts... but then it would deserve to be on top :)

Isn't not beeing ranked high a bad enough penalty?

And isnT' beeing indexed, but not ranked high a million tiems better then beeing innocently penalised?

I think it's a good compromise.

AthlonInside

5:48 pm on Apr 24, 2003 (gmt 0)

It is better to have algo create a list of 'suspected' sites help google reviewer to manually dump pages.

jeremy goodrich

5:54 pm on Apr 24, 2003 (gmt 0)

Inktomi Patented hidden text detection through a variety of means, in 1999.

Google CAN'T develop the same algorithm -> unless they want to violate Inktomi's patent.

they have to go at it a different way.

So until they figure out how to circumvent the other patent, there is NO way they will automatically detect hidden text, etc.

lukasz

6:04 pm on Apr 24, 2003 (gmt 0)

what about legitimate hidden text, for example text made invisible in css, will appear on the page when css is not supported to tell the viewers that their browser does not support css?

skipfactor

6:12 pm on Apr 24, 2003 (gmt 0)

Inktomi Patented hidden text detection through a variety of means, in 1999.

They may have at that but IT DOESN'T WORK and neither does Google's approach, or lack thereof. It's amazing that perhaps one of the first SE Spam techniques still pays dividends!

Simply ignore the text that "offends" the hidden-text recognition algo.

Brilliant idea Killroy, that's more than fair.

jeremy goodrich

6:15 pm on Apr 24, 2003 (gmt 0)

Didn't say if they used their patent or not, only that they have it. Though, Yahoo has it now, i guess.

So if Google wants to do the same thing, they figure out another algorithm...it could be that they have noticed the quality of their results is largely the same, hidden text or not.

BigDave

6:35 pm on Apr 24, 2003 (gmt 0)

I think that the reason that it has taken so long is that google is being very careful to get it right. I would not be too concerned about any problem that anyone here can come up with given a couple of minutes of thought, google has hundreds of brilliant people that think about this sort of thing all day.

I do not expect them to run the hidden text algo on every singel page as they index it. That would be incredibly compute intensive, even for google. I think they will be running it in a somewhat constant fashion on top search results.

The comparison with the expired domains is not correct, because they are not through with implementing that filter yet. While you may be taking an unfair hit till they complete the work on the filter, you have been getting an unfair boost since you bought the domain.

rogerd

6:38 pm on Apr 24, 2003 (gmt 0)

Google are clever boffins, I'm sure they would have tested this very thouroughly

Ask the unlucky webmasters who happened to choose the name "themeindex" for one of their files...

madweb

7:43 pm on Apr 24, 2003 (gmt 0)

If a browser can understand HTML/CSS and decide what is displayed on the screen, google can understand HTML/CSS and decide what is taken into account in SERPs. I know its not simple, but its not rocket science either. IMHO.

madweb

7:47 pm on Apr 24, 2003 (gmt 0)

In fact I could do it myself, in a rounderbout fashion:

1. Visit page
2. Print it out
3. Scan the printed page
4. Use a text interpreter (free with most scanners these days)
5. Set algorythm to work on resulting file.

Of course Google would never do it like that exactly, but you can see that it doesn't actually require massively complex AI.

brotherhood of LAN

7:57 pm on Apr 24, 2003 (gmt 0)

Similar to madweb's suggestion, and another way than the Inktomi way.....a M$ paper

Improving Pseudo-Relevance Feedback in Web Information Retrieval [research.microsoft.com]

The fourth page gives a good idea of how a visual approach could uncover hidden text.

jeremy goodrich

8:04 pm on Apr 24, 2003 (gmt 0)

:) It takes different math, to do different things. Microsoft is one of the biggest companies in teh US to use fuzzy logic & that type of thinking in their engineering...

and China has the most fuzzy scientists in the world. :) that paper points to Chinese engineering, research more than anything...

Perhaps with different thinking, a better method can be uncovered to handle this 'problem'?

As I mentioned, though - Google may have their own detection method, and it very well could be that the results are more or less the same, hidden text or not.

After all, if the hidden text is relevant to the page, and the page is relevant to the query, and in fact, the "best" page for the query, why would Google penalize that web page?

In this instance, it would make no sense, and might even hurt their search results, if they banned such hidden text...

BigDave

8:24 pm on Apr 24, 2003 (gmt 0)

I still think that they should make the hidden word cause the site to never show up for that word.

The more people try to cheat, the worse their resultsx in the SERPs, yet their pages would still be in the index and they would still have the same PR.

It would even make a great defense against google bombing. All Microsoft would have to do to never come up for "go to hell" is hide the word "hell" on their webpage.

skipfactor

8:55 pm on Apr 24, 2003 (gmt 0)

>>After all, if the hidden text is relevant to the page, and the page is relevant to the query, and in fact, the "best" page for the query, why would Google penalize that web page?

Because individuals using text-to-speech to surf the Web are not getting quality results.

I guess it's relevant for them to hear "widget" this, "widget" that repeated 2500 times, but is it the "best" page for them?

jomaxx

9:09 pm on Apr 24, 2003 (gmt 0)

I doubt very much that banning hidden-text sites would harm SERP quality (if done right).

Anyway, even though Googleguy called it an automated check, I think the intention must basically be to flag sites for review, rather than to shoot first and ask questions later.

AFAIK Google doesn't currently parse CSS or Javascript or layers, and while they can download background images they don't do any image processing other than to reduce them to thumbnail size. It would be incredibly difficult to add all these capabilities, and in many cases their spider will not be permitted to crawl the necessary files anyway.

Maybe someone can ask about specifics on Saturday, per Googleguy's suggestion.

AthlonInside

5:41 am on Apr 25, 2003 (gmt 0)

For those who get to the pubconference and talk about this new hidden text algo, please post it here.

Thank You.

tedster

7:23 am on Apr 25, 2003 (gmt 0)

Here's some food for thought.

Google may not care very much if an algo hurts this or that "good" page. They just need to return a SERP that excludes irrelevant pages for their user.

So if this or that "good" page doesn't make the cut because of some algo accident or another - that still doesn't really hurt Google's business purpose. As long as all the pages that DO make it are all good results for the search, Google will thrive. They don't need to return ALL the good pages to serve the public well.

At least if I were creating a filter to catch abuses, I would be thinking this way. But as website owners, we tend to think in terms of how good and valuable our ONE page is.

Google may not be able to afford that kind of thinking. They need to think about presenting a good collection of search results, even though a particularly "good" page may not be there.

It would be a trade off. If the algo eliminates bogus results and returns good ones -- then even though it misses one or two "really good" pages -- that may be a better situation than an algo that returns some really good pages but mixed in with garbage.

Chris_D

7:54 am on Apr 25, 2003 (gmt 0)

I agree Tedster.

But if the new hidden text detection algo has been implemented - from the limited testing I've done so far - I don't think a lot of it. It still allows the 'king of usability' to rank very highly for the misspellings of his name - which are only contained in microtext at the bottom of his page, as previously discussed at

[webmasterworld.com...]

I think an algo that ignores anything which is too small a font size; or is dubious in terms of text colour - rather than wholesale banning - would be a great step forward.

RawAlex

8:04 am on Apr 25, 2003 (gmt 0)

When you talk about hidden text, don't forget such classics as text and links hidden between style tags, and text in colors not exactly like but not that far off from the background.

Of course, the one they may never find a solution for is the web page that is black, has text that needs to be hidden in green, and uses a 20x20 graphics in the same color green as a background image... that one is pretty damn hard to beat without alot more work.

Alex

madweb

8:36 am on Apr 25, 2003 (gmt 0)

RawAlex,
That wouldn't be a problem with a visual approach. Once you've got a page rendered into a graphic format, you can do simple calculations on hue/satuartion etc to decide what is enough contrast to count as 'real' text.

After all, if the hidden text is relevant to the page, and the page is relevant to the query, and in fact, the "best" page for the query, why would Google penalize that web page?

It doesn't have to be a question of penalising, or even of 'ignoring', its just a question if indexing the page as it shows up in a browser, rather than indexing the source code of a page.

This would be a major shift in thinking, but that is what google is good at.

jady

11:27 am on Apr 25, 2003 (gmt 0)

Would also love to see a filter that can pick up obscene amounts of fragmented words. This would take care of the folks who just throw 200 keywords (not in any sentence of meaning) at the bottom of their page.

madweb

12:08 pm on Apr 25, 2003 (gmt 0)

I think google already has some understanding of context/grammar. Perhaps the acquistion of Applied Semantics will improve on this....

[webmasterworld.com...]

RawAlex

4:12 pm on Apr 25, 2003 (gmt 0)

Madweb, you would be talking about another major increase in processing power required to do this to every page in the land of google. That would be alot more work.

More importantly, what happens if there are subtle differences between the way IE displays a page and the way google renders a page? A table that is black in IE might be colored in google-rendered world. Then you get people being punished for no good reason.

Alex

BigDave

4:51 pm on Apr 25, 2003 (gmt 0)

Alex,

Google does not have to tun their spam checkers on every page every month.

They can keep track of their search results and feed the results of first page results into their spam check queue, and filter out those that have already been checked.

Let's face it, no one cares about the unsuccessful spammers. If you get rid of the hidden text spammers that make it to the front page of the SERPs, you would get rid of most of the hidden text SPAM complaints.

And if Google uses Mozilla as it's rendering engine (which would be the sensible thing) and something comes up as hidden text on Moz, I think they should take it out of the top results. That page is automatically garbage for 15% of the users.

RawAlex

5:03 pm on Apr 25, 2003 (gmt 0)

Big Dave, what would your cut off be? 10% of users? 5%? 1%? Considering that Mozilla doesn't treat tables properly, adds border even when border='0' is used, etc, that would be a pretty poor choice. That would be like limiting page size to 20k because too many people are on dialups. Not practical.

As soon as you render a page, some people will be happy, some people will be sad, and some will lose rankings without ever knowing why. I will bet you that most people designing sites don't happen to have a copy of NS 4.x lying around to check compatibility of their sites. If they did, they would know that it doesn't handle coloring of cells in a table very well. That alone could create "hidden text" that isn't hidden - just html not being interpreted correctly.

I still think google would do better at finding "tricks" and hidden text by having people actually REVIEW the higher SERPs. Even if you just review the top 20 for 1000 searches, you will look at 20k pages and almost certainly dump some domains... and in that, you will trap them out of other terms.

Remove the economic cycle of "buy domain, spam domain, get good listings for a couple of times until a new algo comes out, then do it again with another domain" which seems to be the case right now.

Alex