|Bing Scrapes and Repackages Wiki Content|
...and gets Google to index it.
| 9:29 am on Dec 6, 2009 (gmt 0)|
There were a few earlier threads about Powerset & MSFT, including this one when MSFT bought Powerset for $100 Million.
Tonight I was on Bing and clicked one of the links that appear when you mouseover a "hint" (about what the Bing home page image is).
What I found was a bit disturbing.
took me to a Wikipedia article that had been scraped and reformatted as a native Bing page.
The /semhtml/ part of the URL was the only thing I didn't understand, so I did some research..
If you Google: site:bing.com semhtml
you'll find Bing has gotten Google to index 13,000+ pages of scraped Wiki content! Go Bing Go!
I also took a look at the source of the Petra page, to see if Bing was framing the content or rewriting it.
They have it in a div with the following comment:
<!-- republished semhtmlized article -->
It's clearly identified, and appears to be "legal"... but what's next...
I was wondering if they'd also go as far as hotlinking Wiki images, (they didn't -- copies of the Wikipedia images scraped and hosted on the powerset servers).
Another disturbing "detail" was that on Bing, when you search in the "reference section", they have the [nerve?, stupidity?, cahonnes?] to put links below the article abstracts which reads: Continue reading Wikipedia article »
..but clicking the link takes you to another Bing "reference" page.
And all the links within the article take you to Bing pages.
And some people thought Google books was a big deal.
Does anyone else see anything wrong with this type of behavior from a search engine?
| 9:56 am on Dec 6, 2009 (gmt 0)|
In the search results page for the wikipedia content is a link to "enhanced content" which shows content from Wikipedia. At the bottom of the above referenced page is this note:
|All Wikipedia content is licensed under the GNU Free Document License or the Creative Commons CC-BY-SA license or is otherwise used here in compliance with the Copyright Act |
In the search results page, if you click on the link to Wikipedia it takes you to the actual Wikipedia page.
Bing is publishing creative commons content. So what?
|If you Google: site:bing.com semhtml |
Again, so what? This means Google knows about it. Is Google showing bing reference pages? I haven't seen that. So again I must ask, so what? This is not scraping. Google is not showing Bing content. So what?
[edited by: martinibuster at 9:57 am (utc) on Dec. 6, 2009]
| 9:56 am on Dec 6, 2009 (gmt 0)|
Wikipedia content is generaly released under creative commons. Authors want the information to get about, so having it on Bing is one way of doing this.
I am surprized that Bing didnt try and resurect some of their Encarta content for this purpose.
| 7:05 pm on Dec 6, 2009 (gmt 0)|
Google: mike mansfield reference
#1 a Bing / Wikipedia article --- not the original Wikipedia article...
Google: nuclear disarmament reference
again, Bing #1, (without the word "reference" in the search Wikipedia is #1).. but like you said, so what.
| 7:32 pm on Dec 6, 2009 (gmt 0)|
lexipixel, that only means that Bing has the content most likely to be relevant for the words, mike mansfield reference. Google ranks it there because the word "reference" is in the Bing page's title tag and URL. The word "reference" is not in wikipedia's title tag, nor does it appear on the page. Although the word "references" does appear on the wikipedia page, it is not assigned a prominent place on the page as an H1 or as a title tag.
Some might say this confirms how ineffective Google is at referencing duplicate content. But others might say Google is accurately answering the query that is seeking a reference for Mike Mansfield because Bing is it, it has the elements on the page to rank for that. If you are looking for info about Mike Mansfield, then Wikipedia ranks #1 for that query in Google. That's all.
You disagree with how Google is ranking a web page that republishes Wikipedia content. Fine. Whatever. Most people can find something to quibble about with Google's results. This discussion belongs in the Google Search forum because it's more about how Google ranks a page that you do not like. It's not about something nefarious that Microsoft is doing.
Wikipedia's content is free to be republished and that's what Microsoft did. Microsoft's content is open to bots and Googlebot scraped it. Microsoft is not the scraper here, Google is. ;)
| 2:14 am on Dec 7, 2009 (gmt 0)|
|This discussion belongs in the Google Search forum because it's more about how Google ranks a page that you do not like. It's not about something nefarious that Microsoft is doing. |
"..How google ranks a page you (I) don't like ?"
The pages were picked at random, I could care less about Mike whoever he is -- or Bing, (or Google or Wiki's) editorial text about nuclear disarmament.
If nobody else cares that Bing is willing to copy tens of thousands of pages rather than just index them, (like a "search engine"), and then obfuscates the links that lead back to the original source -- I guess I am worried about nothing.
Go Bing Go!
Note to Wikipedia contributors: Consider yourself unpaid Microsoft employees.
| 3:35 am on Dec 7, 2009 (gmt 0)|
|If nobody else cares that Bing is willing to copy tens of thousands of pages rather than just index them... |
Oh, you mean like Google's DMOZ clone [directory.google.com]?
Or like this [bing.com]? I copied the DMOZ snippet, "Help build the largest human-edited directory on the web." and added the word "directory" and ran it through Bing, and Google is the first result, not DMOZ. So does that mean Google is scraping DMOZ and feeding it to BING? Did I cheat by adding the word "directory" to the BING search query? If yes then surely that means adding the word, "reference" to a query to force a BING page to pop up in the Google SERPs is also a bit of a cheat to prove a point?
It's not that nobody cares. It's that there is nothing here worth caring about. Facts speak for themselves. Facts do not need exagerrations, embellishment or deceptive rhetoric. The use of deceptive rhetoric does not cover up the fact that there is nothing substantial here.
Here are examples of misleading phrases. The use of the word "scrape" to describe the lawful use of content is an instance of deceptive rhetoric. Your accusation that BING is "getting Google to index" it's content is another baseless accusation.
There is nothing here. Nothing to care about.
| 12:13 pm on Dec 7, 2009 (gmt 0)|
|There is nothing here. Nothing to care about. |
If you say so.
I'm glad you mentioned DMOZ being cloned on Google, (and countless smaller sub-sets of the data republished on all those sites your "bit of a cheat" query produced).
As a former DMOZ editor (for 5 years), I can tell you when the clones started showing up, I quit editing.
This is the point of my post -- when editors get the idea that they are writing for Microsoft, (Bing is one of the top four "benefactors" of Wikipedia), even more will quit than the news, (WSJ, BBC, etc), were reporting last week.
This wasn't a "Bing is Bad", "Google is Good" post -- just something I noticed -- that Bing decided it was better for them to publish Wikipedia articles themselves than send traffic to them.
| 12:19 pm on Dec 7, 2009 (gmt 0)|
Major Benefactors ($50,000+) Bing
The Hellman Family Foundation
| 12:28 pm on Dec 7, 2009 (gmt 0)|
If we take Wikipedia and Dmoz as examples, they where never intended to be unique. They where intended to be sources of data that people could reproduce. As for Wikipedia editors now being Microsoft employees, I don't understand why you say that, Microsoft are simply using the content as it was intended.
I don't understand why you left the ODP when the clones started appearing. all the clones do is give your category more exposure.
| 1:36 pm on Dec 7, 2009 (gmt 0)|
|As for Wikipedia editors now being Microsoft employees, I don't understand why you say that, Microsoft are simply using the content as it was intended. |
This main page: [bing.com...] says "Wikipedia Articles"... but there is no Creative Commons or GNU license attribution or links. And if one was to interpret the text liked to the "Legal" notices, they state, "In using the service, you may not: ... resell or redistribute the service, or any part of the service"
Bing's "Legal" also says "(In using the service, you may not:) use any automated process or service to access and/or use the service (such as a BOT, a spider, periodic caching of information stored by Microsoft, or "meta-searching");"
Which appears to violate the standard CC license -- if you republish it's supposed to be under the same terms as the original content was licensed for use.
|I don't understand why you left the ODP when the clones started appearing. all the clones do is give your category more exposure. |
Simple: I'll help out a good cause, but if someone is making a buck and I am doing the work, I want my share of the buck. So, I figured my time was better spent editing links for my own directory sites and monetizing them myself.
| 12:30 am on Dec 8, 2009 (gmt 0)|
"There is nothing here. Nothing to care about."
Well of course there obviously is.
The point is Bing is using free content to build a parallel Wikipedia that is ranking in the Google search results, and should rank better over time due to the respect Google will give the Bing domain.
These Bing pages have duplicate content, but are ranking in the results. In some cases now they are outranking the Wikipedia, and even if it is "hardly ever" now, potentially it could be much more often in the future.
These Bing pages include the references and external links, and while nofollowed, they will still provide traffic to the sites, all the moreson the more prominently Bing features these pages in its results.
Significant implications include that it has become slightly more attractive to spam the Wikipedia, but more importantly if these pages are not filtered out as duplicate content they could start polluting up the results
| 10:52 pm on Dec 9, 2009 (gmt 0)|
It seems to me that it's more ethical for a search engine to link to the original article rather than a copy. Especially if it's their own copy and they are putting ads on it.