Welcome to WebmasterWorld Guest from 54.226.130.194

Forum Moderators: open

Message Too Old, No Replies

When is Google going to act on problem results?

Algorithmically recognizable problem results.

     

eddy4711

9:26 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



I've being using Google since the very first days and have always appreciated the crisp search results. But the recent high amount of SE spam in Googles index is quite annoying. This is not a specific spam report, but examples of what I think are algorithmically recognizable spam.

Cloaking via http 'Referer' header (snip)

- Requesting a page without Referer header (e.g. like GoogleBot does) produces a Spam page

- Requesting with Referer header (e.g. link was clicked from Google search results) produces a redirect to a affiliate site, so also doorwaying.

GoogleBot could request once with Referer and once without Referer (and act on substantial differences) or always set a random Referer. Does anyone know of a legitimate use of above technique?

This is an example of spamming by a German company, also a type of cloaking in that the pages the user and googlebot see are very different. Through the following css statement most of the page is rendered outside of the browsers window so the user cannot see it:

h2 {position: absolute; width:800px; left: -220px; top: -100px;}

I'm not quite sure if GoogleBot crawls css, but if it did, it could check that at least 22000 pixels of h2 are offscreen (and of course the tags you don't see are full of keywords and links to other spam). Does anyone know of a legitimate use of this technique?

If there is no known legitimate use of of these techniques I think Google could really reduce the amount of spam in its index.

Eddy4711

[edited by: Marcia at 9:34 pm (utc) on Mar. 31, 2003]
[edit reason] No pointing out specifics, please [/edit]

netguy

9:36 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



eddy4711... Nice laundry list for Google. Unfortunately, I haven't seen any of this taken into account in Google's algo. I have had to deal with competitors stuffing dozens of URLS in invisible image maps (with error-ridden coords) for 6 months - and month after month they are all still there - taking position #1, #3, and #4 out of 2mil results......

apfinlaw

9:45 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



Your comment h2 {position: absolute; width:800px; left: -220px; top: -100px;}

Wouldn't you think it would be relatively simple for G to identify when negative numbers appear (e.g. -220px)
and to identify this as spam.

We, too, have a competitor that spams using this method-has been placing high for c. 5 months now...what a crock!

le_gber

9:46 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I thought the same about css tricks but the thing is that google have to have access to the css file to check whether a layer, div is hidden or off the screen etc ...

But their is many things that, I think, can make it quite difficult:

- you can disallow robots from 'reading' your css files
- you can set the css property in the css file to be visible or in the screen and then override it in the page (either in the head tags or in the div properties)
- you can also use javascript to hide or move a layer outside of a page

there are many ways for spammers to trick SE with CSS and/or javascript and I think that asking the robots to 'check' every pages of a site is impossible in term of time, bandwidth and memory. Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.

Leo

Macguru

9:58 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member macguru is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>>When is Google going to act on SE Spam?

Probably after the IPO. It seems that they make enough money, at this point, with new accounts to adwords, by banning linkage spammers. Even nowadays, content spammers still get away with 1996 style hidden text.

MetropolisRobot

9:59 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



Good calls there Leo.

The only real way to do this would be to have some form of rendering system within the SE that would render each page on a screen and then only work with the material that was rendered (because this is what we'd see right?)

Basically this would be highly expensive BUT you'd then be the first SE to declare some really high quality results mebbe.

phil

le_gber

10:01 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good calls there Leo

Thanks but it took me a while to realize that... and I did report one hidden layer trick before that also ... duh

The only real way to do this would be to have some form of rendering system within the SE

Yes, like .. hummmmm ... a human eye ;)

[edited by: le_gber at 10:04 pm (utc) on Mar. 31, 2003]

felix

10:04 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



Leo. It certainly is a challenge for Google to develop and deploy the necessary algos to recognize these kinds of spam from the standpoints of both code complexity and perhaps even hardware resources (MIPS).

However, the "it can't be done" kind of thinking is what allowed Google to leapfrog their competitors to start with. If Google can't figure out how to beat the spammers (and I believe they are comitted to doing so), someone else will come along and take over as the leading SE. There are too many genius programmers in the world just dying to show their stuff.

felix

10:08 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



Yes, like .. hummmmm ... a human eye

no...more like AI ;)

mrguy

10:08 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They need to bite the bullet and put a few people on as spam coordinators who do nothing but investigate the spam and take action as appropriate.

The algo can not catch everything and lately the SPAM seems to of gotten worse.

le_gber

10:09 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are too many genius programmers in the world just dying to show their stuff

I'm not one of them :)

It certainly is a challenge for Google to develop and deploy the necessary algos to recognize these kinds of spam

It certainly is if they want to follow W3C guidelines and robots.txt properties. I don't see how, if they base their algo on robots spidering Source Code of HTML file they can achieve that ... like MetropolisRobot said they should render the page first...

leo

le_gber

10:15 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



They need to bite the bullet and put a few people on as spam coordinators

I'm sure that as DMOZ volunteers, google would find many people willing to do the job for free.

<thought crosses mind>

Why not adding a spam report for affiliate or so ... these people would have to login and fill reports more completely than the current spam report and sent it directly to a small team who would deal with these reports first and 'normal' others reports after. And in case of abuse, removing the 'bad' affiliates. Membership on 'static' email, meaning no free email servers ... what do you think GG?

MetropolisRobot

10:21 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



mrguy

Spam is getting worse. Why? Well in the words of the Coors advertisement, because you can can can can can can...get away with it.

And if you are a fly by night outfit you can make a killing in a short space of time.

And if you are legitimate you can get a good ole boost to your business and hopefully flip your illegitimacy into legitimacy before you get caught.

As with all things it is an arms race and there are more wannabe spammers and quick buck making merchants than there are policement (google guy-s).

rfgdxm1

10:25 pm on Mar 31, 2003 (gmt 0)

WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>Probably after the IPO. It seems that they make enough money, at this point, with new accounts to adwords, by banning linkage spammers.

Spammers tend to increase Adword sales by the honest companies that need them to compete. Obviously, there isn't much incentive for Google to throw lots of money at spam fighting if doing so means they will also lose revenues.

markdidj

10:47 pm on Mar 31, 2003 (gmt 0)

10+ Year Member



I use hidden layers not to cheat Google but to make my website interactive and interesting for people to see. If they ban one method, another will always be found. Some web-designers use the process of hidden layers legitimatly, others don't. Those that don't will be 'grassed' by their competitors, and the URL will be entered for a manual search. If it doesn't pass this then they are likely to have a lifetime ban.

That's the way it is, and that's the way it should stay. If you don't like someones site and think they are cheating, there are places you can tell Google.

Automatic detection of CSS and Javascript is completely NOT needed. This would keep everyone programming static websites, and the technologies these days can handle alot more than that. How else could you show tutorials that need animations and written text all veiwable on the screen at the same time?

Manual Tests are the only way it can be done.

as for the guy with the -absolute......
was it out of the page, or out of the division, maybe out of an iFrame. You can never tell, unless you do a manual test. Was it going to appear onevent? People are scared to make interactive sites because of the rumours that Google are going to ban anyone using text in a hidden layer.
I found a nice way to protect my site at the moment, that would certainly pass a manual search, but not an automatic one. And that certainly wont be fair.

daroz

1:20 am on Apr 1, 2003 (gmt 0)

10+ Year Member



Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.

I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.

In that case a css file listed in the HTML that is banned by robots.txt could simply increnemt the probibility value, to a lesser degree. Whereas that coupled with several other factors may push some pages over into 'spam' and 'penalty' areas...

Besides, why would you, legitamtely, block a css file from a bot?

Dolemite

1:44 am on Apr 1, 2003 (gmt 0)

10+ Year Member



I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.

IMO, google needs to rethink their penalties. When spam can do well until its caught through a manual review, the system is broken. If a penalty was immediately applied under an automatic detection system, no one would risk it.

bokesch

1:44 am on Apr 1, 2003 (gmt 0)



you can use java to reference a css file...in which case google wouldn't find it (the css) since it doesn't read javascript.

this may explain why some of my competition is baffling me on their rankings.

daroz

6:38 am on Apr 1, 2003 (gmt 0)

10+ Year Member



When spam can do well until its caught through a manual review, the system is broken.

First, I think it's common knowledge Google is contstantly refining it's algos. While I agree with your post the fact you quotd by last remak above it gives the impression you're for more dacronian penalty conditions....

you can use java to reference a css file

You are correct in that you can use Java and GoogleBot will skip it, but there's also the NOSCRIPT tag as well. Either way my origional question of finding a legitemate reason to block your css file in robots.txt stands.

There are valid reasons to use JS to select your css file (browser weirdness most likely), but you may award some penalty points to it such that when combined with other activity it may trigger a ban.

The thing to note is that there are very few instances where you can say if you see "X" in the HTML (or css mabye) you ban them. Even hidden links may not be hidden (background image, or css settings overriding the HTML).

The goal is to tweak the algo such that a combination of common 'issues' my trigger a penalty or ban. Blocked CSS and seemingly hidden text.

At the same time those outright-wrong and clearly in violation issues should be delt with swiftly. Duplicate content is one that, for exactly duplicate content anyway, is often penalized -- above a threshold.

GoogleGuy

6:51 am on Apr 1, 2003 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



We've been looking for some test data like this. eddy4711, would mind sending the same information via a spam report form?

DrOliver

8:19 am on Apr 1, 2003 (gmt 0)

10+ Year Member



@daroz

See in this thread [webmasterworld.com] (msg #50), why I mention my css files in the robots.txt.

Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?

Dolemite

8:30 am on Apr 1, 2003 (gmt 0)

10+ Year Member



We've been looking for some test data like this. eddy4711, would mind sending the same information via a spam report form?

Not to point out the obvious, but I think all of us either look forward to or fear when this test data is put to use in some kind of spam detection algorithm.

eddy4711

8:51 am on Apr 1, 2003 (gmt 0)

10+ Year Member



GoogleGuy: I have entered a spam report with URLs and refered to to: webmasterworld, googleguy, eddy4711

daroz

8:21 pm on Apr 1, 2003 (gmt 0)

10+ Year Member



DrOliver:
Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?

While there is nothing to index, per se, the contents of the css file, I think, are qute pertinent.

I think that the Googlebot is smart enough not to load the css file (now anyway) becuase the 'link' to the css file is not an <A HREF> tag... I haven't had my CSS file pulled by Googlebot.

It's also interesting which thread you linked to -- the thread is discussing using <H1> tags and css files to decrease the size of the actual tag text. The end result would be the increase Googlebot's 'visibility' of the H1 tag text, while not actually increaseing it's zise when displayed on a browser.

I would consider that a (minor) form of inappropriate cloaking.

bokesch

11:50 pm on Apr 1, 2003 (gmt 0)



"Basically, there's just plain nothing to index, so the bots might as well skip these files?"

I agree with this. I've blocked my css files and numerous pages in my site from being indexed simply because they aren't keyphrase dense.

As an example, I have 6 pages in my site that have nothing but jpegs with very little text. I've learned on this forum that its better to use a NOINDEX, NOFOLLOW tag on these pages, because their lack of conent relevant information would only hurt my site in the long run.

Basically, its better to have 10 pages of useful content than a 15 pages site that might have 5 pages of jpegs, and 10 pages of content. Thus, I only show google the 10 pages and <NOINDEX> the other 5.

IMO

GoogleGuy

1:08 am on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member googleguy is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Got it, eddy4711. This is a perfect test case--thank you.

pixelkitty

1:44 am on Apr 2, 2003 (gmt 0)

10+ Year Member



Im wondering how Google will be able to determine the difference between legitimate hide attributes as opposed to spamming ones.

eg: I have several div's on a site with a default of hide, but will be shown when a user clicks on the show/hide link on the same page.

I use this for weblog type pages to display more of an article, or hide/show comments.

Its also used on several of my commerical sites for longer news items and some advertisement delivery.

Will google know this is a legitimate usage, because I have a scripted SHOW link? Or will it only see that the property is set to display:none and classify the site as a spammer?

futureX

1:59 am on Apr 2, 2003 (gmt 0)

10+ Year Member



>>there are places you can tell Google.

Where praytell? :) Its just I know of too many sites with hidden words :(

Marcia

2:59 am on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member marcia is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Right here is their form:

[google.com...]

Chris_D

3:26 am on Apr 2, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi GoogleGuy

What I am seeing an awful lot of at the moment are affililate program spammers - where a 'new' domain goes up every month, ostensibly set up as an affiliate site.

These are all database generated sites - 100 to 700 pages, where each page targets a phrase, each page is spam, and each page provides a redirection to the 'home' page of the 'affiliated' partner.

Quite a neat system (if you like cheats) - because the setup time would be incidental; the domain name is a throwaway (with faked contact info) - and the spammer expects the site to be chucked out after 4 or 5 weeks. If they aren't banned - and it generates traffic (=revenue) for a longer period - what a bonus!

And the perfect twist is that IF Google acts on the "affiliate spammer" - guess what - the "affiliate partner" (ie the guy paying the affiliate spammer for the traffic) gets off scott free because it is only the affiliate partner who gets the penalty.

IMHO you need an 'accessory before the fact' penalty as well. If a site gets more than 3 occurances of affiliates that pull this stunt - then the 'affiliate partner' should also get his site banned. But first - you need to ban the affiliate spammers.

I have submitted details of this activity on numerous occasions - but all the sites I have submitted are still in the index.

Thanks for the opportunity to vent.

: )

Chris
PS Hey GoogleGuy - BTW were you at the Australian Google launch Party last Thursday night? Did I meet you and not know it was you?

PS

This 59 message thread spans 2 pages: 59