| This 51 message thread spans 2 pages: 51 (  2 ) > > || |
|Flavors of Spam|
Some Kinds Are Worse Than Others
Everyone complains about Spam, but the single term hides a multitude of different problems, and different people often seem to mean different things when they use it. I know how we use the term here at Microsoft, but I would be interested to hear your ideas about it; there seems to be enough difference of opinion to make for an interesting discussion.
Here are a few questions to get us started:
Does it make sense to talk about a hierarchy of spam? For example, at the bottom we could put pages that are so bad they’re completely useless. (E.g. a page of gibberish surrounded by ads.) At the top would be quality or authority pages that look great until you view the source or look at the inbound links.
CAN a quality or authority result ever really be spam?
Are affiliate sites “spam by definition?”
Is spam “worse than useless?” Is it worth losing a quality or authority result to get rid of a spam result?
I think we had a very productive discussion about quality and authority last week, so I’m hoping we can repeat that.
|CAN a quality or authority result ever really be spam? |
It is possible, if the means by which spam is detected is statistical, and the site is constructed similarly to other sites that have been marked as spam.
|Is spam “worse than useless?” Is it worth losing a quality or authority result to get rid of a spam result? |
IMO it depends on the individual. Some people are so irked by spam that they'll be willing to lose (access to) a few nuggets just to eliminate all the trash they see.
|I know how we use the term here at Microsoft,... |
Well then, why not enlighten us.
After all, what the SEs think is Spam is what matters, right?
Google's definition of spam is actually perfect: "Trying to deceive (spam) our web crawler..."
Trying to deceive = spam
The amount of authority sites that spam significantly is amazingly puny. Given the status quo, losing such sites if all spam disappeared would be an outstanding improvement.
But spamming should not necessarily be a death sentence. An authority that spams should just have its score lowered, a little or a lot. (Most spam sites have zero authority, so any scoring demerit wipes them out to zero.)
Using the above definition, an affiliate site is not by definition spam. There isn't any relationship between the words affiliate and spam. It becomes spam, for example, if an afiliate pretends to be the official site, but a properly labeled affiliate site is not spam in any way. If it says what it is, and doesn't try to trick anybody, it isn't spam (even if it may be a poor quality result).
Low quality does not equal spam. Almost all spam is poor quality, but poor quality sites are not necessarily spam.
A page of random text is spam because it is trying to pretend to be about a topic, not because it is useless.
Spam is spam, but some spam is worse than other... two words of hidden text is not the same as 200,000 blog links. Spam has a hierarchy but there is also a quality hierarchy that exists in parallel. They aren't one in the same.
But we want to hear YOUR thoughts first. :-) I promise I'll share our thoughts once there are enough comments from others, but I fear that if I post that first, we won't have as much variety; too many people will focus on arguing for or against whatever I post on the topic.
Does that make sense?
Well, first I'd say that Google's definition of 'spam' is far from perfect. Adding 600 or 600,000 pages of mediocre content to a site in an attempt to rank every known keyphrase variation isn't deception. But I call that 'content spam'. In addition, it's not the 'crawler' that's deceived, it's an algo attack.
Then there's 'comment spam', which is again, designed to attack an algo.
Some people call cloaking 'spam' but if the cloaked page that is presented is almost the same with regard to actual content presented to the end-user, why should that be considered spam?
Why should affiliate sites be considered 'spam'? If a Ford dealer opens multiple locations is he 'spamming' a town, a county or a state? Are franchises merely 'spam'? Of course not.
Paid link spam? Couldn't that just be advertising?
Keyword stuffing? That's not spam, that's just stupid and it's an algo defect if that technique works.
So MSNDude, are we talking about 'spam', or are we talking about defective algos. You see, if a technique works, it doesn't matter what you label it, people will use it. Rather than try to combat 'spam' why not promote quality? Ahh yes, but quality is subjective too eh? Much like 'spam'.
If someone 'spams' their way to the top of the SERPs for red widgets, but the end user finds and can order those red widgets, do they even notice spam? No. It's SEOs that look for 'spam'. ; ) It's when the user can't find what they're looking for that they tend to get a little bent.
Isn't it really about relevance? Isn't the goal to provide relevant results? If it isn't, shouldn't it be? Because if the goal is to eliminate 'spam' it will never be achieved.
I haven't seen any definition yet that comes close to achieving consensus on what 'spam' is and I doubt that I ever will. I do know that a majority can agree on what is relevant to a particular query.
>>Is it worth losing a quality or authority result to get rid of a spam result
Ditch quality to remove spam? Nah.
I pick one area to comment on:
|Are affiliate sites “spam by definition?” |
No, not at all. If that were true, then an SE would have to ignore the many, many quality affiliate sites that offer unique perspectives on the products and services they promote. Folks write quality product reviews and travel essays, give personal insights and experiences.
Good affiliates are able to take the sponsor's product and service and speak to many different niches, something that is just about impossible for one company to do. Good affiliates bring value to both the program sponsor and site visitor, make an honest buck doing so, and add a lot of quality content to the 'Net.
Here's a real-world example: Let's say on the 'Net we have a motherboard manufacturer's site that offers direct sales, a major computer retailer's site, and an affiliate of that retailer. The manufacturer offers lists the boards, benefits, features and specs. The retailer lists the same board with, of course, exactly the same laundry list of benefits, features and specs. Along comes the gung-ho geek, an affiliate of the retailer, who writes a detailed review of the motherboard and provides side-by-side comparisons with boards of the same type from other manufacturers. Which site actually offers more value to the visitor?
Now, how you algorithmically differentiate between "spammy" affiliates and value-added affiliates is a problem I'm glad I'm not faced with, just as long as it doesn't affect any of my affiliate sites ;-)
For me, at the top of the list, are serps where all of the listings have the exact same type of title, same wording, with all caps, exclamation marks, etc.
For one keyphrase I monitor, I see the top 10 to 20 pages have titles like
"Best Widgets ONLINE. Get THEM NOW!"
Furthermore, many of them are all subs on the same domain.
I know you guys are working on the subdomain issue, but I think spammy titles designed to increase CTR (with symbols, excessive punctuation, all caps, etc) should also be at the top of the list.
I stickied you an example of a search result where the first 100 results are subs on the same domain.
Whether it makes sense is debatable I suppose, but at least it's predictable. Too bad really.
"Bully Spam" the stuff that intentionally forces destructive or negative digital behavior, (unwanted forced downloads, etc) on surfers, is what bothers me the most.
|Are affiliate sites “spam by definition?” |
If a site with affiliate program advertising suddenly switches to a CPM advertising model, it's no longer spam.
SPAM = Sites Positioned Above Me
Sorry, but webmasters all have their own ideas on what constitutes spam and will moan over say a site listed above them because its perhaps better optimised for the keywords than their own.
It makes good business sense for a site about blue widgets to carry some affliliate adverts if it can for "Blue Widgets" if it has the space - i dont have any issue with that. If the site holds good quality unique original content and some affiliate advertising it provides value it the net imo.
In my mind a site is spamming if it has more than a dozen sub-domains maybe less depending on the size of the site. I see zillions of sites on the net that may be area1-subdomain, area2-subdomain to area3000-subdomain those sort of arrangements are always spam imo.
Likewise sites that have no original content or static pages that are just plain autogenerated dynamic pages designed just to carry adwords and nothing else.
In my experience of working on sites spammers generally dont put the time into building any meaningful content - its nearly almost always dynamic autogenerated pages they use because thats easier to do than building static pages with original content and its always with coppied content.
You dont find many large content sites with original content static pages with good inbound links that are spamming. Some may be well optimised, but that doesnt make them spam.
The acid test imo is if the site brings usefull content to the net. If none of the sites content is of use to the end user then its spam
I thought it was "Someone's Page Above Mine." :-)
It is actually useful to make a distinction between pages that are completely worthless vs. pages that do have some value. If a page is a useless result for any imaginable query, we call it "junk" not spam. A junk page could be "under construction" or it could be gibberish surrounded by ads, or even a page full of fake links. The general idea, though, is that no customer would ever want to see this page on purpose.
Are you arguing that there is no spam other than junk pages?
A few days ago, MSNDude and I had a short sticky discussion about a particular page/site and he indicated that MSN considers a particular linking method to be spam.
I disagreed and gave some examples of topic areas where it made a lot of sense to a human being to categorize and link that way. He just stopped talking about it.
I then talked in more detail about that to my husband and he laughed.
"'Taxonomy' and 'Ontology' are considered spam by MSN?" he said. "What will the librarians think of that when they find out?"
He says to tell you that he uses ontology as used by INCOSE rather than by metaphysicians and that he uses taxonomy in the sense of the definition below rather than restricting it to organisms:
"a set of controlled vocabulary terms, usually hierarchical. Once created, it can help inform navigation and search systems. An example of a simple or "enumerative" taxonomy: United States New York State New York City Manhattan"
If the search engines think that taxonomy and ontology are spam then that explains a lot about the lack of real authoritative sites in several key areas of knowledge on the Internet.
If you are familiar with the reference above, try searching for the authoritative work on the subject. It takes a five word search phrase to bring it to the top of MSN search results.
To quote my husband again ... "If they don't understand taxonomy and ontology then we certainly should not be allowing them to re-define and corrupt our epistemology."
Knowledge is not spam and knowledge organized into categories is definitely not spam.
From a customer/searcher perspective, I think of spammy results as being those that have a combination of low-quality and hard-sell. Spammy results include pages that seem to be pretending to be higher-quality than they are, AND they don't answer my query. From a customer perspective, I don't care how the top result got there, as long as it answers my query.
|Are affiliate sites “spam by definition?” |
Of course not, if an affiliate can give me good product information I can't find elsewhere, they may be a high-quality authority site.
Spam is just an unwanted or unexpected result, one that has no use to the searcher.
You cannot define spam by how it got into a result set.
The origin of the term "Spam" comes from the Monty Python song and refers to the repetition that drives most (or all?) Spam. A fair definition would be a bad search engine result caused by someone doing something over and over again where once should have been enough, but I'm not sure it'll work to just call ALL bad results "spam."
|I haven't seen any definition yet that comes close to achieving consensus on what 'spam' is and I doubt that I ever will. I do know that a majority can agree on what is relevant to a particular query. |
Certainly, in the email world, there is no consensus on what spam is. The solutions proposed have all drawn their share of contention from parties that feel aggrieved by them.
Spam in a search engine context is just a rule. It's not something that needs consensus. It's a pronouncement. Google's definition covers everything perfectly. MSN could make a similar definition, or something else. If Joe Smith says "hidden text isn't spam", that doesn't matter. Joe can do what he wants. MSN can do what they want.
As we can see, some people would love to teleport to a planet where weak content is called spam, where you get shot for writing a medicore movie review. That's just not helpful or sensible. "Spam" does not equal sucky quality. It's not exactly just a coincicence that this is the case, but the two aren't joined at the hip. (Other aspects of the algo deal with quality; "spam" does not need to be the ONE WORD that covers every reason for not ranking something.)
What I wish msndude would have asked is: "People use a lot of tactics to try and trick us into ranking their websites higher than they deserve. Which of these tactics are worse than others?"
Top of my list: any search result for site1.com that when you click the result you go to site2.com/?trackinglink1234
These results never have any merit.
People use a lot of tactics to try and trick us into ranking their websites higher than they deserve. Which of these tactics are worse than others?
I gave two really good specific spam examples in private messages to msndude around June 14th or 15th.
The first one which is obvious spam has moved from being number 1 to being number 2 for the search term while the "real site" has moved from number 2 to number 22.
This is the case of the hacker from Poland who acquired several hundred Yahoo related properties and converted them to "nefarious purposes."
The other specific example has moved up from the third page of msn results to number 20. This is better than the number 1 ranking it had for the last two years but it still astounds me that the topic in question could be considered on topic for the kind of sites which are being promoted in the hidden links.
That site is still ranked number one on Yahoo.
Hidden links are always spam, whether by using css or text/background color or whatever.
Yahoo, MSN and Google all failed on this one and Google continues to be the most collossal failure since they own the site and can turn it off any time they want.
i like subdomains and use them but just for better organisation of my sites.
but MSN is full of single pages located on subdomains. and that sucks.
pages waiting for content
the 2nd largest travel site (that is what they call themselves) uses that technique to get more pages indexed and gets its pages ranked for search terms in google. try searching for some not very popular travel destinations in europe. the domain itself has a lot of trust so pages go up automatically.
BTW. MSNDude, do they force you to use MSN Search at work? ;-)
msndude, you're shifting gears here, you start broad with:what is spam?
|Does it make sense to talk about a hierarchy of spam? For example, at the bottom we could put pages that are so bad they’re completely useless. (E.g. a page of gibberish surrounded by ads.) At the top would be quality or authority pages that look great until you view the source or look at the inbound links. |
now you are fishing for something else: out spam techniques :0
|People use a lot of tactics to try and trick us into ranking their websites higher than they deserve. Which of these tactics are worse than others? |
Spam techniquies will evolve with your ranking algo, on your end its a moving target, you could just cave in and do a sandbox kind of thing like some other McEngine and return less than stellar results or you could just just watch crawl logs go by (tail) and see it live, make calls from that, requires human intervention but at the end of the day its about humans vs. humans, not algos (which can be beat).
Ivory tower IR won't work with whole web crawls because there is no reliable classification system.
But I will give one tip:
go light on PhDs ;)
"Are you arguing that there is no spam other than junk pages?"
Nope, what im saying is that the majority of so called spam is in fact junk. Pages of no real use to the end user - like the one i sent you by sticky mail (junk rather than spam).
IMO its a small amount of sites on the net that are spam sites designed to get traffic to sell on or just for harvesting email addresses whilst offering no genuine original or useful site content to the end user.
I dont think msn need to worry about spam as such but more about quality control - You have far more junk sites and sites of little content, bloggs, sub-sub-sub domain junk, non authority sites ranking over authority sites and general dross listed in your serps than you do have spam sites. I just think the word "Spam" gets associated with "Junk".
If a serps results takes me to site a) and i get Java re-directed to site b) then fair enough take the relevent action.
If a site has nothing but location-subdomains listed on it - its spam, take relevent action.
If a site is thin content, comming soon, a blogg, low quality or of no use to the end user then its junk - again take the relevent action.
If a site has lots of genuine content to it, links to it, but perhaps has high keyword density
on a page and has relevent outbounds then its unlikely to be spam imo. But this kind of situation can be easily mistaken - remember spam teams can check density levels easily of autogenerated pages whilst a webmaster adding content to their site is not as likely to imo.
What you dont want to have is good content deep sites being held back in your serps because they trip a couple of your filters you put in to stop spam whilst thin content junk sites slip through which is currently what i see in a number of search results.
Quality control is key imo - perhaps striking a deal with Yahoo to use their directory data (which imo is the only large unbiased directory on the net imo, rather than outdated dmoz) OR better still start working on your own directory would be a way to introduce some additional quality control to your serps.
Any automated system you introduce will struggle imo to weed out all junk and spam because you dont have the history to your search data or link data that Google and Yahoo have.
>OR better still start working on your own
>directory would be a way to introduce some
>additional quality control to your serps.
Now there is an interesting comment.
MSN does have a directory for listing small business sites but it does not appear to be valued very much by Y, G, or M.
IdolW: Nope. Microsofties are free to use whatever Search Engine they please. We do encourage people to try ours first and only use a competing one if ours fails them, but Microsoft doesn’t compel anyone to do this.
TypicalSurfer: I shifted gears in response to a request from steveb. Really, though, I don’t mean to dominate the discussion. And I certainly didn’t expect people to tell me their secret SEO tricks. :-)
RichTC: You are correct that there is far more junk than spam. And it is very hard to keep from throwing the baby out with the bathwater.
FWIW, I tend to think of SERPs as being composed of three elements:
1) Great Stuff: I searched for "buy blue widgets" and I get a search result with a link to a site where I can "buy" "widgets" that are "blue". The selection of "widgets" is broad and the content is recent.
2) Good Stuff: I searched for "buy blue widgets" and I get a search result with a link to a site where I can "buy" a "widget" that is "blue".
3) Junk: I searched for "buy blue widgets" and I get a search result with a link to a site with an article about a company that made "widgets" declaring bankruptcy, that was written in 2004.
4) Crap: I searched for "buy blue widgets" and I get a search result with a link to a site where I can "learn about" a "vaguely widget-like thingamajiggers" that are "green".
5) Spam: I searched for "buy blue widgets" and I get search results with a link to a site where I can pick from hundreds of links to hundreds of one-page sites or sub-domains, each of which channels me into whatever affiliate program the "Spammer" is promoting. (Not all affiliate programs are spam.)
(Ends rant, climbs down from soapbox)
The worst kind of spamming is not gibberish, but when someone steals content and then uses some SEO tricks to (try to) drown out the original content.
Gibberish pages are reprehensible, but not as bad as really stealing content, because stealing content more hurts the parties who are creating the content which makes the web useful.
At first blush, a page of "borrowed" content might not seem terrible (from the user's perspective at least) but who will make tomorrow's content if the fruits of labor get stolen? (Less new original content is bad for the search engine, too.)
An interesting issue brought up in another forum (#*$!) was that France's laws/treatment towards spam is much different. They allow certain businesses to send unsolicited emails to businesses who may be in the same niche and/or be linked to their product. The wording was somewhat unclear since my french is hazy but this did cause some headaches for web hosts. Some ultimately chose to abide by the laws where their server is located, whereas others cited their AUP/TOS, wherein such disparities were discussed.
Generally, when I think of Spam in a search engine context, I think about fake content which has been assembled in great volume. But, that isn't the only type of SE Spam.
SE Spam usually involves deception (such as fake content) for the purpose of tricking the search engines into giving a site more visibility than it would otherwise have.
Spam and Quality are two concepts that tend to go hand in hand, but they are not the same thing.
A site may have high quality but it may rely on Spam techniques to get more traffic (e.g. creating thousands of fake web pages to inflate the apparent size of the site, or to create the appearance of more link popularity than actually exists).
Spam is detrimental to the search engines and their customers because it creates "noise" or pollution, which makes it harder to find the best documents matching any given query.
Spam is also detrimental to society for roughly the same reasons that vandalism, theft, and mislabeling of merchandise are detrimental – resources are poured into activities that do not contribute any real value. These unproductive activities are profitable for the entity engaging in the activity. Furthermore, if the Spam is successful, it becomes harder, or impossible, for competing sites that don't engage in deceptive practices to survive or prosper.
| This 51 message thread spans 2 pages: 51 (  2 ) > > |