Forum Moderators: mack
The link below is to the white paper on this at MSN Research.
"Automatic and Systematic Discovery of Search Spammers through Non-Content Analysis"
"A common approach to detecting spam web pages is through content analysis based on classification heuristics [2,3]. In this report, we propose an orthogonal context-based approach that uses URL-redirection analysis. Our work was primarily motivated by two key observations:"
And according to News.com it is now in use.
[news.com.com...]
Agree, but on the other hand, doesn’t an engine have more to gain vs. that type of collateral damage? MSN could take the view that Pepsi would not and could not take out Coke … sort of thing … In other words, trusted and important web sites could have a free pass. Engines are going to take the gloves off, they have little choice.
Now you also know why I've been persistently asking if anyone thought their site had been wrongly blacklisted, and why I've been checking those out personally. I'm happy to report, by the way, that all the mistakes I found during this period were generated by the old system; so far I haven't seen a single false positive from the new system -- at least, not affecting anyone on Webmaster World.
Finally, it's certainly true that someone could do this to try to blacklist a competitor (or even the New York Times), but, as you say, we have a plan for that too -- a bit more sophisticated than "give big sites a free pass." :-)
While I'm sure MSN will try to come up with something to prevent that, I can't really see them implementing something that is 100% foolproof - after all, this is Microsoft.
As compared to who exactly? The almighty Google? Kings of the Perpetual Beta and Bad Data Push.
A great point from the whitepaper:
Similarly, advertisement syndicators can detect potential spammers by monitoring those customers who serve ads on a huge number of different URLs through a single account because it is highly unlikely that anyone can generate quality content at that scale.
TAILOR MADE FOR COMBATTING MFA(MADE FOR ADSENSE) PUBLISHERS.
And get this:
The ranked Top Domain list is then used to prioritize manual investigation.
Being actually willing to use human processes instead of depending on automated algos. Kudos to MS for being willing to spend money to make money. Unlike those other skinflints.
[edited by: cabbie at 12:39 am (utc) on July 14, 2006]
I noticed it MSNdude. I run a research site that showed MSN.com beating the snot out of the other search engines over the last couple of months (in terms of relevance). We knew something was up. It's amazing how fast this worked. In a matter of months the SERPs changed drastically for local search. Well done.
You said
Following my rule that "I won't talk about if you don't notice it" was very frustrating this time!
Trust me, I noticed the enormous improvment in my sector. I imagine there are some folks out there who took advantage of the old system that are very frustrated now that their spam pages are gone.
Furthermore, it is amazing to me that you have been so responsive to comments, concerns, and some very harsh criticism (at times). But, I think that receptiveness/attitude is consistent with the business philosophy that helped make Microsoft what it is today. You guys are not only extremely determined and intelligent, but very humble; it is the humbleness that will continue to make MSN search a big player in my opinion. Thanks for listening, and taking such quick action.
It is very smart of you guys to take advantage of all this free labor (i.e., the hundreds, maybe thousands, of eyes out here watching every sector for you and sending in reports). I can't say as much for some of your competitors who won't so much as return an email.
What more can I say? I know this is has been off topic, but the expeditious improvements I have seen deserve recognition. Good work.
Chris
I was kind of disappointed that no one noticed that most of them went away fairly suddenly. Following my rule that "I won't talk about if you don't notice it" was very frustrating this time!
I do see improved results in MSN Search lately when it comes to redirects. I agree that people do not praise MSN Search enough for accomplishes they have made. However, I do still see a lot of sub domain sites in the serps (blogspot.com). I am glad that the Strider Search Defender Team is focused on improving these. We could end up with far better results in MSN than we see with Google and Yahoo.
I agree that MSNDude has been very humble and helpful to all of us here in the Webmaster World forums and we are very glad you are here. MSNDude has been here to get feedback like I have never seen before. Maybe this feedback will help MSN become the #1 search engine based on quality.
Yahoo is also having a difficulty with these blogsite -> redirect problems. I guess we could open up a whole new forum just on these splog sites alone.
Similarly, advertisement syndicators can detect potential spammers by monitoring those customers who serve ads on a huge number of different URLs
I thought that "side-by-side" that Microsoft pulled at the Vegas Pubcon last year didn't go over so well with this crowd... I guess it was considered a success by the Microsoft team.
Added: Ahh, I see now that MSN is also referenced in the paper as having the same issue. Good on ya.
Sure, tracking down common site ownershp and control can be done using the Adsense account number so it does have something to do with Adsense. There are people who crank out a boat-load of domains (URLs) - MFA sites - and run Adsense on them, with all having the same Adsense account number. How hard could it be for a search engine to track down a ton of domains/URLs all using the same account number?
I've got two things coming to my mind, the first just a trifle:
The way I see it, your report strengthens the said spam domains by linking them and not even using "rel=nofollow" - ok, most SEs should have banned the said domains by now, but searching for fendi handbags on serveral SEs is still bringing up the spam domains - with more linkpower by Microsoft...
Second:
I'm using framebusters written in javascript to prevent my domain from being framed by anyone - therefor a user may see a different site (just my domain) than MSN seeing my site framed in another domain ... is your technique able to differentiate between those framebusters and cloaking?
Please excuse any language mistakes - this is not my mother tongue ;-)
Greetings,
Chris
Check out our instructions here:
[search.msn.com...]
Spammers use them as free domain names, so even if they get banned - they loose nothing, and this right now hurts you more than anything else I've seen.
The reason should be easy to see: if you choose to discard entire categories, your errors are cumulative; that is, every good site you lose is lost for good. Do this too many times and you won't have anything left.
It's so simple, even Google was able to "get it", and I have to congratulate them on their "blog search" feature. And they have almost no blogs in the search results I monitor, deffinitely none in the top 100 results.
And if we have to real about it - "Joe's blog" would never be able to compete against a normal website, as the website would get far more links than the blog. So all the blogs that rank on top spots on MSN are only ONLY spam blogs.
And I am not asking to "discriminate" against blogs, just ask your engeneers to spend a day and create a separate category just for blogs. I dont think it will be all that hard to do, and it will greatly clean your results.
*Number of blogspot SPAM subdomains: 23 out of 250 results
*Number of blogspot SPAM subdomains in the first 50 returned results: 12
*Number of blogspot subdomains made by human: 0/250
*Number of blogspot subdomains wich DO NOT redirect: 0/250
I see no quality being lost from removing blog subdomains from the "Web Search" result, on the contrary...
P.S. Forgot to mention that my website ranks in the top 10 on MSN, so I am not just b*chin' ;)
just ask your engeneers to spend a day and create a separate category just for blogs
I'm sure it would take weeks of work by a few engineers (and the interference of dozens of management types) to add a "blog search" (or any "niche" search) to a major search engine.
Toss in the personality of the top management people and you might be into months. Not digging on Bill here... Larry, Sergey, and Terry also bring in overhead (along with their unique positive influences).
The scale of the effort is belied by the apparent simplicity of the final product.
The first problem that would have to be solved to accomplish what's proposed here would be to reliably identify blogs. After all, if you offer a specialized "blog search," people will be unhappy if it finds things that aren't blogs or misses things that are. This could end up just introducing a new source of error which would be added to any current errors.
The second problem is "who is going to use this feature?" If the whole point of it was to avoid dealing with blog spam, that suggests that this new blog search is going to be pretty poor -- or that we'd STILL have to fix the blog spam problem while maintaining a more complex system.
The last problem is "exactly how are we going to expose the UI for this?" If you have a great feature but no one can figure out how to access it, then it didn't really help the customers very much. The UI can be just as important as the AI. In this case, I don't quite see how the UI is supposed to work.
If someone seriously proposed the feature described here, all three of those arguments would have to be answered inside the two feature teams who would own it. Given agreement there, it'd be presented to the management of MSN Search itself. If the teams had good answers to the three questions above, I expect it would be rubber-stamped and would ship within a month of the time it was ready. (Probably not on a Friday or a weekend.) :-)
No one higher up the organization than the General Manager for MSN Search would be involved at all, although if we were really proud of the feature, we'd definitely show it off to them.
So when it takes time to fix things, it is definitely not due to bureaucracy; we have very little of that. Instead, things usually take longer than you're think because a) the solutions aren't obvious b) it takes time to test the solutions c) the tests show that the "solution" creates more problems than it solves or d) no one is available to work on it -- they're working on higher-priority things.
You need to understand that most people find no use in conventional blogs - that is the blog Joe created to describe what he did the day before - completely useless information, nobody would ever search for it, let his friends go and read it.
The blogs which are worth reading - they have their own domain names(and I dont look at them as "blogs" anyways).
Google figured it out, and istead of completely taking off blogs from their results (i.e. the subdomains from blogspot etc.) they just moved them away from what's more important - the web search results.
You dont have to worry about sacrificing the quality of your search results, as if you look at the numbers I gave you - none of the blogspot results were created by human, thus - no quality to be lost to begin with. And I have never seen a blogspot subdomain with real sentences ranking on MSN EVER. Well, all you have to do is go to your search engine and type in a competitve word and see for yourself.
Or at least, for the love of God, make the links from blog subdomains count for nothing when your algo is at work.
If you guys clear all the blog subdomains from your results you will have probably one of the best results. And this is what I want, as my website ranks great on MSN, but MSN just doesnt generate enough traffic - maybe if you show some clean results more people will move from Google to you, and the monopoly will disappear...I have a dream...