| This 48 message thread spans 2 pages: < < 48 ( 1  ) || |
|MSN Search Claims to Freeze Out Web Spam|
|In a sample of one billion web pages, Microsoft claims that eight per cent are spam. |
In one case, the Microsoft researchers claim to have found a webpage in Germany that would constantly create pages filled with pieces of text that were copied from random web pages, linked to a porn site.
It's also interesting to note that SEO probably has become more profitable and economically worthwhile the more relevant search engines have gotten.
Although, at the same time those talented Ph.D.'s really need to publish something good if they don't want to perish as scholars. That paper was poor, and should not be accepted for publication in an IEEE or ACM journal IMHO. And if they can't do better scholarly work than that, they won't be staying in the M$ environment. It's a catch 22 with Ph.D's in a for profit enterprise like that.
This paper makes them look stinky. If they are actually stronger than that, let's see the papers in ACM or IEEE. I know for fact we have qualified peopel in WebmasterWorld who can critically review papers of that level on this topic.
Certain journals are hard to get into, some are not. Some are hard to get into, but not if you are addressing a subject which is pretty leading edge ... like Web Spam.
After all, web Spam isn't exactly a well known subject which has attracted the most elite the world over. It's still pretty new and if you have the foresight it's still pretty easy to get published in this area..
After all, you got to get the discussion rolling somehow. Sometimes it's just best to take the plunge and throw something out there for people to discuss.
[edited by: blaze at 3:06 am (utc) on June 12, 2004]
They may have tricks up their sleeves but the big question is what they have their heads up. :) I don't think that any self respecting search engine manager would let people like these loose on his SE with such dodgy theories.
|That research may have been 'naive' but you've got to realise that it's a baseline for them. They're not going to publish trade secrets on how to fight spam like that. The tricks they have up their sleeves, are just that - they are up their sleeves. |
I don't think that Microsoft is stupid enough to let such technologically unaware people be the cutting edge of its search programme. What may actually be happening is that there are two sides to this: the active search engine side and the academic side. The academic side produces papers of such dubious value that everyone underestimates what Microsoft is up to with its main search engine programme. The scary thing about the word density diagram is that it approximated a bell curve.
The paper itself does not even look like a proper, refereed academic paper in that the methodology and formulae are glossed over or are completely missing. Maybe these guys were great in college or something but in the real world they seem to be a liability for Microsoft.
The main emphasis, even according to Gates, is the proper classification and contextualisation of search results. SERP spam almost always can be solved with coarse filtering for the main part. The remaining spam can then be handled using filters with a finer granularity.
I read a bit of the paper .. it's my intention to read the whole thing more closely at a later point.
However, from what I saw, I didn't see anything "dodgy".
Perhaps if you want to throw up a dodgy straw man and make some vague assumptions about what they are concluding rather than poking holes in their research, yes, it'll be pretty easy to say it's dodgy.
However, I assume nobody wants to do that because it's a waste of time.
So let me ask - what specifically did you think was dodgy? A quote from the paper and a reference to evidence which contradicts their research, or a clear list of logical flaws seems in order if we're going to do a "critical review".
However, I don't think anyone here really has the time or the wherewithal to do that.
I'm sure we can all provide anecdotal evidence or heresay as to why their research is "dodgy" but for anyone here to actually present some data analysis and experimentation on a data set the size of which only very few companies have would be very very surprising to me.
The fact is, that it's research and they have observed some statistical data points which provide insights over a large data set where they did not exist before.
So let me turn this conversation around. Can someone send me a link to a more credible paper on web spam? I have only seen one other on terminology, and it certainly didn't compare to this..
While the datasets are large, the way in which they were handled is, to my mind at least, dodgy.
|So let me ask - what specifically did you think was dodgy? A quote from the paper and a reference to evidence which contradicts their research, or a clear list of logical flaws seems in order if we're going to do a "critical review". |
"In addition, we retained the full text of 0.1% of all downloaded pages, chosed [sic] based on a hash of the URL."
Then later on in section 2:
"Unfortunately, we did not retain the full-text of any downloaded pages when the crawl was performed."
So it would not be possible to carry out any verification or alternative analysis of the data. This is a problem with the research, and therefore the paper, from what I can see.
The statistical analysis is not really a problem here. The problems are that the datasets no longer exist and that there is no description of the parsing process that was applied to generate the fulltext from each webpage sampled.
|The fact is, that it's research and they have observed some statistical data points which provide insights over a large data set where they did not exist before. |
Yes but because they effectively destroyed their datasets, the research is essentially a one off work. The statistical methods are probably sound but without knowing more about their methods of data collection, in the background there is the question of whether they found what they were looking for because the data collected fitted their hypothesis.
Actually this is a good indication as to how competitive the whole search engine business is at the moment. Most SE operators do not comment on the methods and tactics they use to protect the integrity of their indices. As a result, what appears is not necessarily state of the art or indeed anything to do with the art. These papers may have some good ideas but for the average SE operator, an elegant general solution is of more importance. Perhaps it is this kind of view from the frontline that makes me a bit cynical about these papers from these armchair generals. :)
|So let me turn this conversation around. Can someone send me a link to a more credible paper on web spam? I have only seen one other on terminology, and it certainly didn't compare to this. |
I for one do have the time to sift through those papers... I get good clues, even from stinky ones like this one. I don't buy it as a front... it is very much like a paper presented at an International conference (readily accepted, little peer review).
There are journals publishing in this area... check out the IEEE Internet Transactions and the ACM journals. (Sorry, but fetching the actual refs is too much effort for this post).
Anyone who publishes things like under visual inspection they appeared to be spam in a paper about spam detecting, is ridiculous.
This is way over my head, dudes. And I can't say I'm exercised about paybacksa loosing his independant wealth. But I do feel sorry for the cat.
Anyone who publishes things like under visual inspection they appeared to be spam in a paper about spam detecting, is ridiculous.
Why's that? Heuristics are a very well known and accepted technique. How do you think we deal with Email Spam?
In fact, the whole foundation of AI is based on pattern recognition. Spam recognition will never be deterministic.
As an aside, the ODP has an interesting (and for all their childishness, one I agree with) definition of spam: spam is a website not fit for inclusion in their directory.
And by the way, if our arguments are going to be credible when we refer to evidence we can't simply say it exists. 75% of the value of arguing is providing the evidence, not simply saying that it exists somewhere.
For example, the comment earlier about how the MSFT researchers didn't provide their evidence was a very good one and one I wholeheartedly agree with. It's pretty easy going around saying 95% of the time this showed up in the dataset without actually having to provide the dataset.
While one could argue the importance of AI in general (I will not do that here), it is not fair to say visual inspection == pattern recognition == heuristics.
Heuristics is by definition empirical decision making. You make judgments based on observations or evidence at hand, usually from data or experimentation. Heuristics is not one-off visual pattern "recognition" by a human, which seemed to be what was implied here.
Any paper that is about automated spam detection should not offer one-off visual observations as evidence that their algorithm is effective. That is nonsense.
The real reson I called it rubbish was that they didn't define spam clearly, and then went on to these lame observations. Clearly, what looks like spam to one person may look like luncheon meat to another. And I don't even go near the bias arguments that could be made (an author trying to prove his point will see luncheon meat as spam more often than the average M$ millionaire).
Ahh, a fabulous paper from Microsoft Search saying that Microsoft Search is a 100% spam-free diet.
Of course, it's like opening a can of real spam: where's the meat? Microsoft's index is currently 100% spam-free, because it consists of exactly zero pages. All the rest is the same old marketing drivel, building expectations rather than real substance.
When they ever get round to launching something, we'll be able to judge - but don't hold your breath about the zero-spam stuff. MSN Search will probably be just another algo ready to be beaten, twisted and manipulated.
Again, I can only ask (hey, I'll even beg) that you provide something more credible. I am sure someone would have posted it on these forums if there was something better out there.
Anyways, it's not perfect, obviously. However, they do define spam: "
spam”, that is, web pages that exist only to mislead search
engines into (mis)leading users to certain web sites."
The heuristics are based on the stats. Obviously, if upon manual inspection (what other kind can we have) something is spam a certain percentage of time when it has a certain feature then we can develop reasonable heuristics which can help us find spam by looking for those features.
Again, though, it's not perfect. However, it's a tough, new field and I haven't seen anything better and until one is presented the only valid, time efficienct conclusion I can reach is that this is the most credible paper out there so far and provides the best baseline we've got.
But, prove me wrong. That would be great :)
In fact, other than some papers on link spam these guys don't know of anything more credible either..
Henzinger et al.  identified web spam as one of the
most impiortant challenges to web search engines. Davison
 investigated techniques for discovering nepotistic links,
i.e. link spam. Move recently, Amitay et al.  identified
feature-space based techniques for identifying link spam.
Our paper, in contrast, presents techniques for detecting
not only link spam, but more generally spam web pages.
All of our techniques are based on detecting anomalies
in statistics gathered through web crawls. A number of papers
have presented such statistics; but focused on the trend
rather than the outliers.
Broder et al. investigated the link structure of the web
graph . They observed that the in-degree and the outdegree
distributions are Zipfian, and mentioned that outliers
in the distribution were attributable to web spam. Bharat
et al. have expanded on this work by examining not only
the link structure between individual pages, but also the
higher-level connectivity between sites and between top-level
Cho and Garcia-Molina  studied the fraction of pages
on 270 web servers that changed day over day. Fetterly et
al.  expanded on this work by studying the amount of
week-over-week change of 150 million pages (parts of the
results described in this paper are based on the data set
collected during that study). They observed that the much
higher than expected change rate of the German web was
due to web spam.
Earlier, we used that same data set to examine the evolution
of clusters of near-duplicate content . In the course
of that study, we observed that the largest clusters were attributable
to spam sites, each of which served a very large
number of near-identical variations of the same page.
I will shut out MSN bot on all sites, as a mean sof declining to participat in their research into what is/is not spam.
Can't see any other reason to allow that bot unless it is cloaked.
Yeah paybacksa it seems like a good idea now that I look at my logs. If Microsoft has not got the basic cluefulness to observe proper HTTP Result codes and insists on ripping off webmasters bandwidth (bandwidth that has to be paid for) then a simple ban is necessary. Its mnsbot64134 started downloading from a directory that has always been banned by robots.txt here. That bot has now been banned.
If Microsoft is running a proper search engine then I will allow them access. If these people who put out such dodgy research papers think that they can use my site in their dubious 'research' then I have a very simple solution from a great Irish writer - each can individually go procreate with himself. :) Actually given the quality of their research paper, they have a great future ahead - in something like astrology. How would Microsoft like it if people started downloading its OS and other products, for their own research naturally, and not paying licence fees to Microsoft?
Msndude - if you guys are running a real search engine/bot there, then the observance of HTTP results is essential. If it is those people in the Microsoft Astrology Department running their experiments then tell them to come on here and apologise to all the webmasters for wasting their bandwidth.
If these people who put out such dodgy research papers think that they can use my site in their dubious 'research' then I have a very simple solution from a great Irish writer - each can individually go procreate with himself. :)
hahahaha... I am reminded of a phrase uttered too often in my house. It begins with pug and it ends with the first part of a well known Irish name Mahoney. ;-)
|I don't think that Microsoft is stupid enough to let such technologically unaware people be the cutting edge of its search programme. What may actually be happening is that there are two sides to this: the active search engine side and the academic side. |
I'd have to disagree with this, my guess is that they are in fact doing this. That paper reads like most flakey sociology papers I've read, just enough scientific method to make it sound convincing, but since there is no actual SE discipline per se, that's where it ends. The theory will drive the development at the beginning, the program architects planning the stuff have to know what to tell the programmers to write.
Microsoft is afterall the same company that couldn't write the cd burning driver for windows xp and hired the adaptec guy to do it for them, and that's relatively simple compared to creating a search engine.
Pseudo science is an interesting feature of our science worshipping culture, even when there is no real science or method to back it up, it often gets accepted as real science, for example economics, sociology etc, even when the stakes are extremely high, such as the functioning of the US economy, and those stakes are far higher than creating a new se algo.
I could see them trying to implement this, my conclusion is that SEO is an excellent field to enter now, much better than web design.
|MSN Search will probably be just another algo ready to be beaten, twisted and manipulated. |
I think you hit the nail on the head here encyclo, don't give these pseudo discipline pseudo intellectuals more credit than they deserve, my guess, again, is that a paper like that is in fact as good as they can do...
| This 48 message thread spans 2 pages: < < 48 ( 1  ) |