Hand Editing for Spam?

Forum Moderators: mack

Message Too Old, No Replies

Hand Editing for Spam?

Another offshoot of http://www.webmasterworld.com/forum97/336.htm

Tigrou

9:32 pm on Jan 26, 2005 (gmt 0)

In terms of hand editing the SERPs, I thought Yahoo already did this. In some hyper competitive areas I thought they occasionally manually adjusted 3 or 5 sites to the top.

It is a good point about monitoring people who remove sites re: corruption. Most online newsrooms have a tagteam setup...one writes, the other edits before you can post live. Then you send to a final check before you go live. Even with this though, you still see typos on CNN etc..

To REMOVE data you'd probably want something simplier than that...maybe just two levels i.e. Level 1 surfs for problems and then sends them onto their manager. Manager (Level 2) confirms the problem, removes THAT site and then classifies type of problem (with comments) before sending it off to the anti-spam programmers to look at the underlying code for footprints etc..

And once a site is taken out by Level 2, it'll be gone in all related SERPs as well, of course.

gmiller "Humans are horribly expensive, and they burn out quickly in repetitive jobs like that." - you'd think so, but in reality they don't. At least the offshore link submitters working for me aren't.

While to western standards this may not sound like a great job (although many do it here), there are many people in developing countries that would be very happy for steady work of this nature for $500/month or even $300/month. Assume 50% cost overhead, plus computers, and an SE could have 100 people doing this all year for $1 million dollars.

That would make a massive difference to SERPs but it'd be a drop in the bucket compared to how much $$ spam sites drag in. (And how much extra SEs could make if they had less spam.)

webhound

9:39 pm on Jan 26, 2005 (gmt 0)

Well lets hope that one of the 3 is bold enough and smart enough to figure out algos are simply not enough for a truly good search engine. As for it being a cost that isn't reasonable, thats simply not true. Thats why I was suggesting interns or the like. Students, etc... train them to spot the spam and then flag the sites, which then would be removed by a superior to make sure they don't remove clean sites by mistake. This is a very doable system and it just boogles my brain that none of them have this in place.
Guess none of them actually care about being the "best". Shame on them. :-)

2by4

2:46 am on Jan 27, 2005 (gmt 0)

Well, since this was partly my fault, have to chime in...

What interests me is the complete refusal to accept that humans can do this job better than an algo. But I've had a lot of programmer friends, and I'm pretty familiar with the mindset. That doesn't make it right, but it is a real issue with these companies.

Another thing struck me after posting in the other thread: not only are humans MUCH better at detecting spam out of the box, as it were, their ability to learn to detect new spam methods would be essentially built in, humans learn. Once you've built up a few weeks [I'd say days, but I'll try to be as conservative as possible] experience, not only will you have seen it all, you will then be building on that experience, and can easily detect new methods as they come up. In other words, your employees learn.

I see most of the systems being automated, much like the email spam filters I mentioned, and really only requiring human judgement at the point where the selected sites come up for judgement.

As usual, when something like this is suggested, most people spend about zero effort doing any thinking, and just spit out verbatim what the search companies have said in various public statements, as if that's an answer, or even relevant. I know google wants to fully automate the process, that's the whole point, that is the problem. So hearing somebody repeat that is kind of annoying. I also know microsoft wants to create a secure windows, but that doesn't mean I'm going to take their efforts or products as secure.

The problem is that the search engines believe that they can emulate human judgement algorythmically, and they can't. They can automate large parts of the process, and they can use the results from the human filtering to further automatically tighten and tweak the spam filters, to the point where certain types of pages / sites / link schemes would be found to always be spam, without exception. Such sites in the future would require no further human checking.

this is not, contrary to what some posters will invariably say, like clockwork, a question of doing human editing of 8.5 billion webpages. Not even remotely close

It's a twofold training process: the spam filters are trained, and give their results to humans, who judge them and give them back to the automated system, which uses that input to refine its spam detection algo.

A system like this would actually make a link like: this site is spam meaningful. Because when that was clicked, you would have so many people's eyes on the problem, but you wouldn't let them decide, the sites would simply be entered for review, and if they passed it, would not be viewed again unless certain criteria were met [to avoid spam report spamming].

As the spam detection systems grew more refined, and as the human reviewers got better at their jobs, the system would automatically detect and eliminate unquestionable spam techniques with no human intervention. Obviously there would be a curve to this, starting out steep, but then flattening out as all new spam techniques are discovered and registered.

This is exactly how real email spam filtering works, and it works very very well. And this is not a new technique.

Why a company feels like employing humans is a bad thing is beyond me, humans can be very effective at this kind of thing, obviously google needs to learn a little more I think about what exists outside the box they live in.

In many ways, a system like this would be an ideal open source project, or at least an open community, but it would need to be run as a business to enable control of quality.

Reading some of the other reactions, you'd think that spammy sites are some huge mystery to see, like you need a phd to ever be able to detect spammy techniques. Do you have any problems identifying spam email? I don't, and neither do most users, except for that 1% that insists on clicking on the links and keeping the whole thing going in the first place.

Tigrou

12:48 pm on Jan 27, 2005 (gmt 0)

webhound, 2x4 : congrats. We're in a unique thread in WebmasterWorld where the issue is actually clear and everyone actually agrees a path forward. :-)

Except - it seems - the SEs.

webhound

6:34 pm on Jan 27, 2005 (gmt 0)

Thanks Tigrou and great post as always 2X4.

Yeah it's very frustrating to see the Search Engines fumbling around with inadequate algos when all they need to do is have human eyes scanning the top 5 pages of the more competitive serps. How the engines can continue to allow spam crap rule their serps is beyond me. (and I don't have a PHD.)

I just hope one of them is smart enough to see how this is the only solution to the problem.

2by4

9:01 pm on Jan 27, 2005 (gmt 0)

I think the main reason they can't realize this is that they are trying to program in the judgement, which means that they are starting with the solution they think will work rather than looking at the problem clearly and then working down from the problem to the solution.

If the problem is spam, and if humans are better at identifying spam than algos, then you build humans into the system, from the initial design. Otherwise you're jusat adding hack after hack, which ends up like windows 98, junk that you have to thow away.

Compare this to serious spam filtering techniques, which have been around for years, and work, and which were built from the ground up to include human judgment.

gmiller

12:53 am on Jan 28, 2005 (gmt 0)

I don't know about your email spam filters, but around 99% of what gets past the ones used by my ISP and web host are spam, so that's probably not a good example. :)

I just don't see what good it is to only examine the top five or ten or whatever. I don't see much spam in the top 10-20 results on any major search engine.

2by4

2:04 am on Jan 28, 2005 (gmt 0)

Yes, that's the problem, people are thinking of automated antispam systems like spam assassin etc, which are not very effective, that's because they can't be by design. That's an exact duplicate of the problem with using that type of technology in an automated search engine spam filter, at best it catches 80%, your ISP's obviously isn't any good, which is common.

The real spam filters use a completely different logic. They construct the picture of what is spam based on what you tell it spam is. They always adapt based on what you tell it. To initialize one common version, you create a spam folder, manually add all the emails that you know are spam to it from your inbox, then install the program, tell it to initialize based on the contents of the spam folder. That's it's first training. Then you modify it as more spam come in, false positives are detected, etc. After a few weeks you do not see any more spam. Until the spammers change techniques, at which point you tell it that email is spam, and it adds that method to itself.

This system works extremely well. Tests have shown almost no false positives in thousands of emails.

gmiller

3:03 am on Jan 28, 2005 (gmt 0)

You're talking about Bayesian spam filters of the sort built into Mozilla Mail and Thunderbird. Yes, that catches most of the spam, but still leaves more spam in my email than Google or MSN do in my search results with their current systems.

It's not at all the same as having a humans edit search results on a site-by-site basis, and a learning system for search results would be far more complicated because Bayesian analysis of words in the content wouldn't help at all with many of the tactics used by SE spammers.

In short, I'm not sure what your point is here or how emulating spam filter companies would help Google, Yahoo, and MSN.

If the idea is to have humans look at search results and then adjust the algorithms to catch new types of spam, then that's not a useful idea because it's exactly what the SEs have done all along. In fact, SEOs and webmasters are doing the reviewing for free as we speak and emailing the SEs to rat out their competitors' latest tricks.

If the idea is to have humans manually look at only high-ranking pages and ban or approve them individually, then the results won't look much different from what we get from the search engines right now, because current SE results just aren't all that bad. Manually tweaking a site's ranking here and there doesn't seem like a revolutionary wave of the future to me.

When you deal with search results and SEO issues all day, it's easy to lose perspective and turn every little detail of who's ranked where into a crisis, but the reality is that we're mostly talking about fine-tuning something that works pretty well for Joe Searcher. We're a long way from the old days, when every search on every SE turned up a couple of porn sites on the first page.

Tigrou

4:38 pm on Jan 28, 2005 (gmt 0)

gmiller

current SE results just aren't all that bad.

We are talking Google, Yahoo and MSN right? At least 40% of the results I see in teh first two in competitive areas are either spam, or marginal content that obviously games the system (e.g. new pages on old unrelated domain).

MSN has less spam for now, but that's because it isn't so dependent upon links and cloaked/content-focused only spam has wanned. Give it a few weeks tho.

cleanup

5:15 pm on Jan 28, 2005 (gmt 0)

Tigrou, excuse my ignorance, but can you please define what

"offshore link submitters working for me"

means?

webhound

5:27 pm on Jan 28, 2005 (gmt 0)

GMILLER:

What categories are you in? lol!

No spam? Man I wish I could say the same for the ones we work in. :-)

Tigrou

5:41 pm on Jan 28, 2005 (gmt 0)

Tigrou, excuse my ignorance, but can you please define what "offshore link submitters working for me"

No worries. I have a small team of people that work offshore performing many tasks but mainly submitting links to "free submit" directories.

Go onto Rentacoder and Elance and you can find many people willing to do similar tasks. RAC/elance are normally for short-term, specified job situations.

You can still get people for long term at reasonable rates. In that case you need a major investment in finding the right people and training. But once you sort it out though your productivity takes quantum leaps at Model-T prices. Sticky me if you want some outsourcing tips.

2by4

10:47 pm on Jan 28, 2005 (gmt 0)

"You're talking about Bayesian spam filters of the sort built into Mozilla Mail and Thunderbird."

No, that's a very simplified version of the system I'm talking about.

I'm going to go out on a limb here and guess that google, or any other search engine, can do just a little bit better than a local small application you run on your email client.

If you're still seeing spam you made a mistake in training your filter. It's very hard for those filters to figure out what isn't spam if you misidentify an email at some point, I made that kind of mistake and it took a few weeks of counter examples to get it to see all the spam again, it's actually better to just start fresh, wipe out the spam database.

I disagree that this is what google has been doing all along, google says explicitly that this is not what they are doing, and is not their goal, so I'm not clear where you get the idea that it is what they're doing.

What they're doing is more like what the major isp spam email filters do, that's a fully automated system that has some human review built on top of it, exactly like the system search engines currently appear to be running. I know that system doesn't work, it doesn't work for email spam, and it doesn't work for autogenerated adsense directory sites. Any filter that can't catch an autogenerated spam directory site isn't worth s#$t.

All of Google's efforts are focused on fully automating this process, they say that, there has never been anything else said. Obviously they use people to look at sites and give engineers more information on how to do the automation, but that's as far as it goes. They are committed to an automated process. There was an excellent blog posting by a google engineer, referencing spam specifically, where he openly admitted that a: they are committed to a fully automated process, and b: humans always do a better job.

Which is exactly what I'm saying. However, trust me, it doesn't matter if anyone agrees or disagrees, the first company that can beat spammers by design, not as an afterthought, will win the longterm game.

Technically, given that an email is in fact a simple webpage, it's also not clear to me why you think something that works on single email pages with almost no resources behind it in terms of program complexity, size and so on wouldn't work when run through huge systems like google has access to, with algos at least 10 times more sophisticated, and custom designed for that one single system.

From reading these forums for a while, I have never seen any algo tweak that created a situation where different people in different industries all agreed that they were not seeing any spam any more. So there's no success I can see that would support the idea that search engines are doing a good job in this regard.

gmiller

12:06 am on Jan 29, 2005 (gmt 0)

webhound: I think you kind of demonstrated one of my points there by asking what categories I'm in. Yes, I can come up with some results that could be a little better in categories where I hyperanalyze every detail. But for *real* searching as a *user*, what I get is quite satisfactory on all the major engines. The problems I have as a *user* don't come from spam, they come from the limitations of searches based solely on keywords with no context.

2by4:

Trained spam filters make a substantial number of mistakes, whether you mistakenly mark a spam mail as legitimate mail or not. That's just a limitation of probability-based keyword filtering.

The reason training systems are so limited is that they can't spot new techniques. They can't change the algorithm, only tune the knobs. For all we know, SEs may already be using similar techniques to tune their algorithms. It's just not the magic fix you seem to think it is.

I'll try to explain my comment about what the major engines are doing now again, in hopes that you'll understand it if I word it differently: Humans take a look at the spam techniques that are in use. Either the SEs do it themselves, or they review reports coming in from outsiders. They then adjust their algorithms to try and catch the technique.

I'm also not clear how you can equate "good job" to "not seeing any spam"... Perfection is more than good, and it won't come from a training-based knob-tuning system. Beware of silver bullets.

2by4

12:50 am on Jan 29, 2005 (gmt 0)

hmm. I think we're just destined to completely miss each other's points here.

From what I can see, techniques like bayesian spam filters are not in fact using keyword detection, they are doing a full analysis of the entire object. That's why they can block fully emails that contain only image binary data for example.

I guess I could read the various open source programs that run this to see what they really do, but since I can already see what they really do, it's not a major concern to me.

I think one reason that you don't quite see the power of these things is that you've done something wrong with them, I'm not the only person on this planet by the way who is able to make these things work, one study I read showed only I think a single false positive out of something like 10,000 emails. My dad has used this for years, an older version, and he too gets almost zero spam. But he has worked with comptuters etc for many many years, so he understands how to do processes like that, and that was using a much older product, eudora. Again, the people who tested and show these results are not stupid people, and they aren't making them up. Try to see what you are doing wrong is my advice, and then fix that. But don't confuse your error with failure of the basic method, unless you really like getting spam. By the way, I hope you're not confusing things like outlook's pseudo spam filter, which is also something very much like the system google uses, and has about as bad results, or worse, and assuming that you are actually using a real spam filter. Likewise for various commercial products that use updating patterns etc.

I use these systems and I don't get spam, and I use teh most simple version available currently. If that was inadequate, which it isn't, I'd move to a more advanced version. I'm sorry if the methods don't work for you, what can I say, they are user trained. Of course they need you to tell them when a new method breaks through, that's the entire point I'm making, that's the essence of my point, the algo is built around this training, it's the polar opposite of the automated approach, which creates continuous new algos and tweaks to try to catch stuff, as opposed to starting out with a self creating detection method.

It took me only about 1 month of using real spam detection to see the difference in power, I've used automated techniques like spam assassin for years, which have never better than about 80% quality. I'm currently running at about 99%, give or take 1%, some weeks I see zero, some weeks I see 1.

AmericanBulldog

1:41 am on Jan 29, 2005 (gmt 0)

hhhmmmmm

A couple of assumptions were made at the begining of this thread...

And how much extra SEs could make if they had less spam

Well, the search engines are themselves the biggest spammers/whores whatever you wish to call them, and they make a fortune off the sites filling up their indexes, would they really be better off with 50% (or whatever the % is) fewer listings?

As far as human editing, bring it on, it is doomed before it starts, the site building process can be automated with some degree of success, and flood the engines with sites, far more than any team of humans could ever keep up with.

Algorithmycally (sp?) there are just cetain things that cannot be achieved with 100% accuracy 100% of the time.

2by4

1:49 am on Jan 29, 2005 (gmt 0)

"the site building process can be automated with some degree of success, "

LOL, I see I'm onto something here. And an automated process, you think that doesn't leave recognizable signatures? The very easiest things in the world to detect. And you only need to detect a few instances of this manually before all new instances get automatically detected and removed. It's exactly because spammers use automated techniques, and the fact that search engines are apparently unable to detect such clearly signatured sites in their clever little algos that I started realizing that it's their clever little algos that are at fault. Which means the algos were written with certain assumptions, and those assumptions were inadequate to the task.

This is exactly what happened with windows 95, it was inadequate as a frame work for future windows growth, and was dumped completely after the last ME release of that line. And replaced complelely by the nt line.

Google is still running the equivalent of windows 95 when it comes to search technology.

gmiller

8:59 pm on Jan 29, 2005 (gmt 0)

The way Bayesian spam filtering works is that the software takes a full message with headers and splits it into individual words.

You start with a corpus of previously classified messages and calculate the percentage of messages containing each word that are spam. You then take the words that most often indicate spam and the ones that most often indicate non-spam and build a list of probabilities.

When a new message comes in, you calculate the combined probability for all the words from the list that occur in the message. If the score exceeds a threshold, you call it spam. If not, you let it pass. The threshold is generally set very conservatively, as false positives are considered far more important than false negatives.

And I never said anything about large numbers of false positives in my email. I said they generate a lot of false negatives. In any case, I follow the discussions of the technique with considerable interest, and your results are unusually good. If even the developers of spam filtering software don't get the near perfection you achieve, *on the corpus from which the statistics were compiled* then your results are certainly atypical.

The paper describing the technique (search for "A Plan for Spam") came out in August 2002. After that, spammers adapted and the filters became much less effective. The technique was refined and it still catches the bulk of the spam that I receive. However, spam that gets through still outnumbers the legitimate mail people send me. Why? Because I get many hundreds of spam emails per day.

In any case, SE spam is not the biggest problem with SERPs these days, and neither training systems nor pure human review will radically improve the results. Let me give you an example. Last night, I did an experiment. I chose the keyword "mortage" for a test. It's pretty competitive by all accounts, but I deliberately chose one I've never tried to optimize for in order to avoid the "forest for the trees" syndrome.

I did a search on Google. What did I get? Lots and lots of sites that looked like they were related to the keyword "mortgage". Just to be sure these weren't just clever spammers, I clicked on each of the top ten. All of them were relevant in some way or another.

The only problem here is that a real user would have been searching for mortgage-related financial advice, mortgage calculators, lenders, lender comparison sites, or something else more specific, but Google can't tell what the user wants. Learning to interpret context is far more important than throwing the kitchen sink at the spam problem.

2by4

7:53 am on Jan 30, 2005 (gmt 0)

Thoughtful answer gmiller:

About the only "general rule" that kept showing up was Stupid beats smart. That is, a fiddly and delicate piece of tokenization magic [such as the current google algo] would often produce worse results than something that just took a brute force "just grab everything" approach. read more [spambayes.sourceforge.net]

That particular one has options to use many more components than just words, it was simply a matter of them deciding which page elements to tokenize. In this case they decided to only use some components, but those other parts are present and easily useable.

Re: spam. In a sense I'm playing devil's advocate here. I don't believe google's current issues are an effort to control spam. But I do believe that as long as there is money in the web, there will be spammers trying to take some of that. And they will use search engines to try to do that. However, I do think the big influx of spam type pages pushed google, and maybe also yahoo, to a point they did not anticipate arriving at, and which their current algorythm's were not written to cope with. Just like Windows 9x was not written with any security in mind whatsoever, it was a very naive system. And it paid the price, and continues to pay that price.

Google built their system around a core belief that the web was filled with information, and it was just a matter of getting the right information to you, the searcher. Spammers however see the web as a way to spend almost no money and be able to make money. This gives the spammers an edge in the current version of this game, forcing search engines to do ridiculous things like block new sites from their indexes [google], dump 90% of a site's content pages to make room for more new sites [yahoo].

When I look at this scenario, I see one and only one thing: the search engines are against a wall and are doing desperate things to keep their systems running.

Deleting spam in real time in a real way would be an agressive move. One reason you didn't see spam in that particular search was because of this agressive blocking, which is like shooting a shotgun at a guy holding a hostage, you get everyone in your attempt to get the bad guy. This isn't a very smart way to do it.

What the search engines need to do is find a better way to cope with their spam issues, the way they are using now does not work. Except in the very short term, this past year specifically.

Tigrou

8:38 pm on Jan 30, 2005 (gmt 0)

2x4, normally I'd take your POV on this as it seems to be right, but AmericanBulldog is fairly knowledgeable about said matters.

If he says it, we may not like it, but we gotta believe it. (Hope that doesn't sound too defeatist for spam-free listings.)

2by4

9:50 pm on Jan 30, 2005 (gmt 0)

"Well, the search engines are themselves the biggest spammers/whores whatever you wish to call them, and they make a fortune off the sites filling up their indexes, would they really be better off with 50%"

This is something I have a hard time disagreeing with, since it gets closer to the more hardcore money side of things.

But for autogenerated sites, which by definition have to leave clear signatures all over themselves, I could think of nothing that should, with a functioning algo, be easier to autodetect and remove from the main results. Failure to do such a basic thing is what's made me wonder about the whole spam fight thing, if you can't catch an autogenerated site, with autogenerated html, architecture, etc, what exactly is the point of even pushing that method?

I think one practical reason to get rid of that x percent of sites is to simply enable a simpler system, one that deals with as much real non spam data as possible, on a consistent basis. It's sort of like hotmail, it costs them money to store, process etc all the spam that runs through the system, it consumes resources, a lot of resources. And it's the same with google et all, they have to spider all the pages they hit, they have to process them. If they could simply remove domains from this primary function that would free up a tremendous amount of resources long term.

But over the last year I just haven't seen them manage to do this consistently. There's no need for 100% success, 95-99% would be more than adequate.

I think I'm doing different searches than gmiller, there's been points over the last year where consistently certain types of technical google searches gave me doorway pages, spam pages, all designed to trap my technical search and drop it straight into a doorway type page. Those pages all had exactly the same structure.

Clearly Google has no incentive to get rid of adsense sites, since it makes money off that. But if it goes too far in that direction, it's in danger of jeopardizing its primary search portal's adwords income, if msn can grab even 10% of google's marketshare that could have very serious, and immediate consequences, especially on their stock prices.

gmiller

10:21 pm on Jan 30, 2005 (gmt 0)

The trick is that from a competitive standpoint, it doesn't matter how many innocent sites get knocked out of the top of the results. Joe Searcher can't really tell. It's a question of whether or not they get quality, relevant sites at the top, regardless of how many quality, revelant sites get buried. Collateral damage is a big issue for me as a webmaster, but it means nothing to users, and therefore little to Google, MSN, and Yahoo.

As for autogenerated sites, you're right that it's easy to get most of it. But while a 99.5% (I'm just picking a big number here) success rate may sound really good, it's not so hot when you consider the volume of junk being generated that way. If you autogenerate one million pages and 99.5% gets detected, that's still 5000 pages of spam that just got through, presumably highly optimized spam that will rank well. I'm always stunned that we don't see more irrelevant crap than we do.

2by4

12:31 am on Jan 31, 2005 (gmt 0)

"Joe Searcher can't really tell."

I think this is true in the sense that joe searcher isn't aware of this problem, but when you start talking about 1 year's worth of collateral damage, something starts happening, it's not a conscious thing in joe searcher's mind I suspect, but I think he does notice that something is different. This creates a weakness that can be exploited by competitors, ie msn.

It doesn't have to be exploited overnight, and I'm not very convinced that msn will have what it takes to really go after google in terms of providing high quality results, at least not for the coming 6 months, but after that, we're talking about an increasingly large percentage of the total web.

If Google and Yahoo were operating in a complete vacuum I'd agree that there would be little that would make joe start looking elsewhere at some point, but that's not the case any more. The final nail in the coffin for netscape 4x was IE 4, which had radically superior CSS and javascript handling, and simply wiped out Netscape as a meaningful competitor.

It's a risky strategy, it's working now, Google got their 55 billion, the main shareholders will get their blocks of cash when they sell off their shares, but then G is going to have to get back to work, core mission of providing the best, most upto date results around. From this year's web too. And last year's.

It will be interesting to see what msn does once they start getting it working, that should be a while, and I don't see myself using them much anyway.

What will really be interesting is to see how msn deals with spam, and how good a job they are doing. They are not holding back last year's new sites currently, will that continue? I suspect so. Will yahoo begin doing full indexing of sites again? Doesn't seem like they're making any moves to do that, quite the contrary, they don't seem very interested in really competing, but that could change with management changes.

AmericanBulldog

2:22 am on Jan 31, 2005 (gmt 0)

2x4,

Perhaps they are achieveing 95% success....My guess is you are seeing the tip of the iceberg....

clever little algos that are at fault

exactly!

If you autogenerate one million pages and 99.5% gets detected, that's still 5000 pages of spam that just got through, presumably highly optimized spam that will rank well. I'm always stunned that we don't see more irrelevant crap than we do.

YES, you nailed it, 5,000 still get through, and yes, it's highly optimized. People who build automated pages aren't going to try and sell you red widgets if it's the green widgets your after. Automated does not equal irrevelant.

webhound

7:18 pm on Jan 31, 2005 (gmt 0)

Well in exploring the idea of using humans in combination with algo's, which I still think is the only way to go. This could be used to add additional factors into the algo that simply aren't there now. Techniques such as backlink blogging and other backlinking methods used to manipulate serps could be caught and added into the equation. Looking back a year in G or Y illustrates that serps have gotten worse, not better despite all the "updates" we've witnessed in this time frame. I simply don't believe any algo, however cleverly crafted, will be effective without human intervention. Partnering humans with the algo would be the best way to catch as much spam as possible and will eventually build a algo that works. Just my opinion of course, but would love to see one of the engines get on this.

2by4

9:16 pm on Jan 31, 2005 (gmt 0)

No matter what, the algo has to be written from the ground up to detect and delete spam. Neither google or yahoo were. So they are forced into increasingly convoluted schemes to handle such issues. That's known as spagetti code, Windows 98 was a legendary example of this problem, I'd say the google algo is fast approaching 'full rewrite required' status. And I'd go one further, looking at how there has been no sign of this happening, despite frequent comments to the affect that 'google is all powerful, they would have already fixed it..', my guess is that Google is in fact doing a full rewrite of their system, and it's taking a long time. Hopefully they are, anyway.

I'm still slightly confused by one thing however: the claim that spammers can generate far more sites automatically than google can ever catch. What kinds of numbers are we talking about here? In the email spam world, I've read estimates that if you shut down something like a 100 spammmers, most of who live in Florida apparently, at least half the world's spam would be shut off. And creating spam emails takes almost no skill, just download some software, get some smtp servers, buy a cdrom of the latest email addresses, and away you go. Generating new websites is much more complex than this, with many more steps, and you actually have to spend some money in the process.

So how many of these seo spammers are there? My guess is that there are not very many. And the number of sites they can and will make is quite finite, well within the ability of human editors to keep up with, especially once the initial curve is broken through.

It's given that they all use a few basic applications to do this job, those applications will leave clear signatures. A fully automated system is going to be looking more at individual pages, and has no judgement. But an automated system tuned to detect what it has been told is spam by human judgement could easily construct very precise page hashes that would be very difficult to avoid. Obviously the best thing you can do as a search engine company is to hire some real spammers, who know and are known in the network, and keep up on the latest automated techniques those spammers use.

Anyway, it's nice to dream. google's dual indexes are now confirmed by google, not that it wasn't transparently obvious in the first place. We'll see what happens. My guess is that we'll see an equivalent of windows me, or maybe that's what we're seeing now, a temporary hack designed to keep certain systems running, and to guarantee adwords income, until the real new system is installed.

BeeDeeDubbleU

10:25 pm on Jan 31, 2005 (gmt 0)

. I simply don't believe any algo, however cleverly crafted, will be effective without human intervention.

Webhound I have been saying this for months in various parts of this forum. It's clever people who create the programs that beat the algorithms. Perhaps a bit unethical at times but clever none the less. To even suggest that an algorithm can beat them is just plain stupid.

When humans are eventually used to combat this it will be stopped virtually overnight. 2by4 is correct in saying that a combination of both could be used but humans will always be required to screen the results. Common methods are used by these people and they do leave signatures. These signatures would not be hard to find by trained humans. Once found they could be swiftly eliminated and I don't think that this would take an army of people.

The real problem is that the spammers generate a large proportion of Google's Adwords revenue so currently it is not in their best interests to do anything about it.

2by4

10:47 pm on Jan 31, 2005 (gmt 0)

"To even suggest that an algorithm can beat them is just plain stupid... The real problem is that the spammers generate a large proportion of Google's Adwords revenue so currently it is not in their best interests to do anything about it."

BeeDee, like always there are different reasons google is not doing much currently, your two points suggest the main two:
1. A stubborn refusal born out of a corporate culture that is set on maintaining a fully automated system, in the mistaken belief that great programmers can in fact beat the problem. But the problem with this belief is as you noted, it's just plain stupid. There is nothing in the history of AI to suggest this is possible. This was a dream in the 70's, 80's, 90's, and now the 2000's, and it's failed decade after decade. Why, because it's a stupid idea. Computers are fast adding machines. That's it. Period, end of story. IBM's chess systems are simply super fast adding machines that run through a stunning array of options at stunning speeds. They don't have judgement, and trying to emulate human judgement is stupid, and arrogant, and is something that I would expect from only somebody who has spent so much time in their life programming that they no longer understand what it means to be alive and human, and have come to confuse the computer with life. Because they work so much with computers, they start to think that our minds work like computers, then go on to think, well, it's just a matter of emulating that computer like behavior with a computer. But only people who work with computers all the time think that our minds work that way, nobody else does. Thus the correctness of 'it is stupid'.

2. Follow the money. Happily no complex explanations are needed for number 2, although some people seem to have difficulty with this simple element of business 101.

gmiller

12:00 am on Feb 1, 2005 (gmt 0)

Frankly, I don't see how you can say that Google's algorithms weren't designed from the start to deal with spam. One of the biggest advantages of global PageRank was that it severely weakened traditional spamming techniques and made it harder to manipulate SERPs. It wasn't a complete solution, but they were clearly thinking about spam from day one. That leap forward is probably the biggest reason why there's so much less obvious spam getting through than in the InfoSeek/AltaVista days.

I agree that the presence of fresh sites is an advantage for MSN. I use MSN myself when I need something that's likely to be mostly on recently added sites. But if they started letting exceptionally high amounts of spam through, the false negatives will do them more harm than Google's false positives ever could. It'll be interesting to see what happens over the next few months as people begin to target MSN to a higher degree.

This 42 message thread spans 2 pages: 42