homepage Welcome to WebmasterWorld Guest from 23.22.97.26
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 59 message thread spans 2 pages: 59 ( [1] 2 > >     
When is Google going to act on problem results?
Algorithmically recognizable problem results.
eddy4711




msg:47936
 9:26 pm on Mar 31, 2003 (gmt 0)

I've being using Google since the very first days and have always appreciated the crisp search results. But the recent high amount of SE spam in Googles index is quite annoying. This is not a specific spam report, but examples of what I think are algorithmically recognizable spam.

Cloaking via http 'Referer' header (snip)

- Requesting a page without Referer header (e.g. like GoogleBot does) produces a Spam page

- Requesting with Referer header (e.g. link was clicked from Google search results) produces a redirect to a affiliate site, so also doorwaying.

GoogleBot could request once with Referer and once without Referer (and act on substantial differences) or always set a random Referer. Does anyone know of a legitimate use of above technique?

This is an example of spamming by a German company, also a type of cloaking in that the pages the user and googlebot see are very different. Through the following css statement most of the page is rendered outside of the browsers window so the user cannot see it:

h2 {position: absolute; width:800px; left: -220px; top: -100px;}

I'm not quite sure if GoogleBot crawls css, but if it did, it could check that at least 22000 pixels of h2 are offscreen (and of course the tags you don't see are full of keywords and links to other spam). Does anyone know of a legitimate use of this technique?

If there is no known legitimate use of of these techniques I think Google could really reduce the amount of spam in its index.

Eddy4711

[edited by: Marcia at 9:34 pm (utc) on Mar. 31, 2003]
[edit reason] No pointing out specifics, please [/edit]

 

netguy




msg:47937
 9:36 pm on Mar 31, 2003 (gmt 0)

eddy4711... Nice laundry list for Google. Unfortunately, I haven't seen any of this taken into account in Google's algo. I have had to deal with competitors stuffing dozens of URLS in invisible image maps (with error-ridden coords) for 6 months - and month after month they are all still there - taking position #1, #3, and #4 out of 2mil results......

apfinlaw




msg:47938
 9:45 pm on Mar 31, 2003 (gmt 0)

Your comment h2 {position: absolute; width:800px; left: -220px; top: -100px;}

Wouldn't you think it would be relatively simple for G to identify when negative numbers appear (e.g. -220px)
and to identify this as spam.

We, too, have a competitor that spams using this method-has been placing high for c. 5 months now...what a crock!

le_gber




msg:47939
 9:46 pm on Mar 31, 2003 (gmt 0)

I thought the same about css tricks but the thing is that google have to have access to the css file to check whether a layer, div is hidden or off the screen etc ...

But their is many things that, I think, can make it quite difficult:

- you can disallow robots from 'reading' your css files
- you can set the css property in the css file to be visible or in the screen and then override it in the page (either in the head tags or in the div properties)
- you can also use javascript to hide or move a layer outside of a page

there are many ways for spammers to trick SE with CSS and/or javascript and I think that asking the robots to 'check' every pages of a site is impossible in term of time, bandwidth and memory. Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.

Leo

Macguru




msg:47940
 9:58 pm on Mar 31, 2003 (gmt 0)

>>When is Google going to act on SE Spam?

Probably after the IPO. It seems that they make enough money, at this point, with new accounts to adwords, by banning linkage spammers. Even nowadays, content spammers still get away with 1996 style hidden text.

MetropolisRobot




msg:47941
 9:59 pm on Mar 31, 2003 (gmt 0)

Good calls there Leo.

The only real way to do this would be to have some form of rendering system within the SE that would render each page on a screen and then only work with the material that was rendered (because this is what we'd see right?)

Basically this would be highly expensive BUT you'd then be the first SE to declare some really high quality results mebbe.

phil

le_gber




msg:47942
 10:01 pm on Mar 31, 2003 (gmt 0)

Good calls there Leo

Thanks but it took me a while to realize that... and I did report one hidden layer trick before that also ... duh

The only real way to do this would be to have some form of rendering system within the SE

Yes, like .. hummmmm ... a human eye ;)

[edited by: le_gber at 10:04 pm (utc) on Mar. 31, 2003]

felix




msg:47943
 10:04 pm on Mar 31, 2003 (gmt 0)

Leo. It certainly is a challenge for Google to develop and deploy the necessary algos to recognize these kinds of spam from the standpoints of both code complexity and perhaps even hardware resources (MIPS).

However, the "it can't be done" kind of thinking is what allowed Google to leapfrog their competitors to start with. If Google can't figure out how to beat the spammers (and I believe they are comitted to doing so), someone else will come along and take over as the leading SE. There are too many genius programmers in the world just dying to show their stuff.

felix




msg:47944
 10:08 pm on Mar 31, 2003 (gmt 0)

Yes, like .. hummmmm ... a human eye

no...more like AI ;)

mrguy




msg:47945
 10:08 pm on Mar 31, 2003 (gmt 0)

They need to bite the bullet and put a few people on as spam coordinators who do nothing but investigate the spam and take action as appropriate.

The algo can not catch everything and lately the SPAM seems to of gotten worse.

le_gber




msg:47946
 10:09 pm on Mar 31, 2003 (gmt 0)

There are too many genius programmers in the world just dying to show their stuff

I'm not one of them :)

It certainly is a challenge for Google to develop and deploy the necessary algos to recognize these kinds of spam

It certainly is if they want to follow W3C guidelines and robots.txt properties. I don't see how, if they base their algo on robots spidering Source Code of HTML file they can achieve that ... like MetropolisRobot said they should render the page first...

leo

le_gber




msg:47947
 10:15 pm on Mar 31, 2003 (gmt 0)

They need to bite the bullet and put a few people on as spam coordinators

I'm sure that as DMOZ volunteers, google would find many people willing to do the job for free.

<thought crosses mind>

Why not adding a spam report for affiliate or so ... these people would have to login and fill reports more completely than the current spam report and sent it directly to a small team who would deal with these reports first and 'normal' others reports after. And in case of abuse, removing the 'bad' affiliates. Membership on 'static' email, meaning no free email servers ... what do you think GG?

MetropolisRobot




msg:47948
 10:21 pm on Mar 31, 2003 (gmt 0)

mrguy

Spam is getting worse. Why? Well in the words of the Coors advertisement, because you can can can can can can...get away with it.

And if you are a fly by night outfit you can make a killing in a short space of time.

And if you are legitimate you can get a good ole boost to your business and hopefully flip your illegitimacy into legitimacy before you get caught.

As with all things it is an arms race and there are more wannabe spammers and quick buck making merchants than there are policement (google guy-s).

rfgdxm1




msg:47949
 10:25 pm on Mar 31, 2003 (gmt 0)

>Probably after the IPO. It seems that they make enough money, at this point, with new accounts to adwords, by banning linkage spammers.

Spammers tend to increase Adword sales by the honest companies that need them to compete. Obviously, there isn't much incentive for Google to throw lots of money at spam fighting if doing so means they will also lose revenues.

markdidj




msg:47950
 10:47 pm on Mar 31, 2003 (gmt 0)

I use hidden layers not to cheat Google but to make my website interactive and interesting for people to see. If they ban one method, another will always be found. Some web-designers use the process of hidden layers legitimatly, others don't. Those that don't will be 'grassed' by their competitors, and the URL will be entered for a manual search. If it doesn't pass this then they are likely to have a lifetime ban.

That's the way it is, and that's the way it should stay. If you don't like someones site and think they are cheating, there are places you can tell Google.

Automatic detection of CSS and Javascript is completely NOT needed. This would keep everyone programming static websites, and the technologies these days can handle alot more than that. How else could you show tutorials that need animations and written text all veiwable on the screen at the same time?

Manual Tests are the only way it can be done.

as for the guy with the -absolute......
was it out of the page, or out of the division, maybe out of an iFrame. You can never tell, unless you do a manual test. Was it going to appear onevent? People are scared to make interactive sites because of the rumours that Google are going to ban anyone using text in a hidden layer.
I found a nice way to protect my site at the moment, that would certainly pass a manual search, but not an automatic one. And that certainly wont be fair.

daroz




msg:47951
 1:20 am on Apr 1, 2003 (gmt 0)

Morever if you forbid robots to check your .css file how can it tell legitimate to spammed div/layers.

I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.

In that case a css file listed in the HTML that is banned by robots.txt could simply increnemt the probibility value, to a lesser degree. Whereas that coupled with several other factors may push some pages over into 'spam' and 'penalty' areas...

Besides, why would you, legitamtely, block a css file from a bot?

Dolemite




msg:47952
 1:44 am on Apr 1, 2003 (gmt 0)

I don't believe there's much "If X is true then == Spam" kind of logic in Google's algo. I would expect it to be more like "If X is true then %chance_spam += 5%"... Where it would take several things to create a penalty.

IMO, google needs to rethink their penalties. When spam can do well until its caught through a manual review, the system is broken. If a penalty was immediately applied under an automatic detection system, no one would risk it.

bokesch




msg:47953
 1:44 am on Apr 1, 2003 (gmt 0)

you can use java to reference a css file...in which case google wouldn't find it (the css) since it doesn't read javascript.

this may explain why some of my competition is baffling me on their rankings.

daroz




msg:47954
 6:38 am on Apr 1, 2003 (gmt 0)

When spam can do well until its caught through a manual review, the system is broken.

First, I think it's common knowledge Google is contstantly refining it's algos. While I agree with your post the fact you quotd by last remak above it gives the impression you're for more dacronian penalty conditions....

you can use java to reference a css file

You are correct in that you can use Java and GoogleBot will skip it, but there's also the NOSCRIPT tag as well. Either way my origional question of finding a legitemate reason to block your css file in robots.txt stands.

There are valid reasons to use JS to select your css file (browser weirdness most likely), but you may award some penalty points to it such that when combined with other activity it may trigger a ban.

The thing to note is that there are very few instances where you can say if you see "X" in the HTML (or css mabye) you ban them. Even hidden links may not be hidden (background image, or css settings overriding the HTML).

The goal is to tweak the algo such that a combination of common 'issues' my trigger a penalty or ban. Blocked CSS and seemingly hidden text.

At the same time those outright-wrong and clearly in violation issues should be delt with swiftly. Duplicate content is one that, for exactly duplicate content anyway, is often penalized -- above a threshold.

GoogleGuy




msg:47955
 6:51 am on Apr 1, 2003 (gmt 0)

We've been looking for some test data like this. eddy4711, would mind sending the same information via a spam report form?

DrOliver




msg:47956
 8:19 am on Apr 1, 2003 (gmt 0)

@daroz

See in this thread [webmasterworld.com] (msg #50), why I mention my css files in the robots.txt.

Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?

Dolemite




msg:47957
 8:30 am on Apr 1, 2003 (gmt 0)

We've been looking for some test data like this. eddy4711, would mind sending the same information via a spam report form?

Not to point out the obvious, but I think all of us either look forward to or fear when this test data is put to use in some kind of spam detection algorithm.

eddy4711




msg:47958
 8:51 am on Apr 1, 2003 (gmt 0)

GoogleGuy: I have entered a spam report with URLs and refered to to: webmasterworld, googleguy, eddy4711

daroz




msg:47959
 8:21 pm on Apr 1, 2003 (gmt 0)

DrOliver:
Basically, there's just plain nothing to index, so the bots might as well skip these files. This is a legitimate reason, isn't it?

While there is nothing to index, per se, the contents of the css file, I think, are qute pertinent.

I think that the Googlebot is smart enough not to load the css file (now anyway) becuase the 'link' to the css file is not an <A HREF> tag... I haven't had my CSS file pulled by Googlebot.

It's also interesting which thread you linked to -- the thread is discussing using <H1> tags and css files to decrease the size of the actual tag text. The end result would be the increase Googlebot's 'visibility' of the H1 tag text, while not actually increaseing it's zise when displayed on a browser.

I would consider that a (minor) form of inappropriate cloaking.

bokesch




msg:47960
 11:50 pm on Apr 1, 2003 (gmt 0)

"Basically, there's just plain nothing to index, so the bots might as well skip these files?"

I agree with this. I've blocked my css files and numerous pages in my site from being indexed simply because they aren't keyphrase dense.

As an example, I have 6 pages in my site that have nothing but jpegs with very little text. I've learned on this forum that its better to use a NOINDEX, NOFOLLOW tag on these pages, because their lack of conent relevant information would only hurt my site in the long run.

Basically, its better to have 10 pages of useful content than a 15 pages site that might have 5 pages of jpegs, and 10 pages of content. Thus, I only show google the 10 pages and <NOINDEX> the other 5.

IMO

GoogleGuy




msg:47961
 1:08 am on Apr 2, 2003 (gmt 0)

Got it, eddy4711. This is a perfect test case--thank you.

pixelkitty




msg:47962
 1:44 am on Apr 2, 2003 (gmt 0)

Im wondering how Google will be able to determine the difference between legitimate hide attributes as opposed to spamming ones.

eg: I have several div's on a site with a default of hide, but will be shown when a user clicks on the show/hide link on the same page.

I use this for weblog type pages to display more of an article, or hide/show comments.

Its also used on several of my commerical sites for longer news items and some advertisement delivery.

Will google know this is a legitimate usage, because I have a scripted SHOW link? Or will it only see that the property is set to display:none and classify the site as a spammer?

futureX




msg:47963
 1:59 am on Apr 2, 2003 (gmt 0)

>>there are places you can tell Google.

Where praytell? :) Its just I know of too many sites with hidden words :(

Marcia




msg:47964
 2:59 am on Apr 2, 2003 (gmt 0)

Right here is their form:

[google.com...]

Chris_D




msg:47965
 3:26 am on Apr 2, 2003 (gmt 0)

Hi GoogleGuy

What I am seeing an awful lot of at the moment are affililate program spammers - where a 'new' domain goes up every month, ostensibly set up as an affiliate site.

These are all database generated sites - 100 to 700 pages, where each page targets a phrase, each page is spam, and each page provides a redirection to the 'home' page of the 'affiliated' partner.

Quite a neat system (if you like cheats) - because the setup time would be incidental; the domain name is a throwaway (with faked contact info) - and the spammer expects the site to be chucked out after 4 or 5 weeks. If they aren't banned - and it generates traffic (=revenue) for a longer period - what a bonus!

And the perfect twist is that IF Google acts on the "affiliate spammer" - guess what - the "affiliate partner" (ie the guy paying the affiliate spammer for the traffic) gets off scott free because it is only the affiliate partner who gets the penalty.

IMHO you need an 'accessory before the fact' penalty as well. If a site gets more than 3 occurances of affiliates that pull this stunt - then the 'affiliate partner' should also get his site banned. But first - you need to ban the affiliate spammers.

I have submitted details of this activity on numerous occasions - but all the sites I have submitted are still in the index.

Thanks for the opportunity to vent.

: )

Chris
PS Hey GoogleGuy - BTW were you at the Australian Google launch Party last Thursday night? Did I meet you and not know it was you?

PS

This 59 message thread spans 2 pages: 59 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google News Archive
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved