|brotherhood of LAN|
| 10:37 pm on Dec 20, 2013 (gmt 0)|
I remember a well-known media tycoon also taking exception to Google gathering and regurgitating its content.
The boilerplate answer is that robots.txt would be respected if you wanted to prevent content getting indexed. The fact that a scraper could end up getting it listed (and perhaps considered the originator of the content) makes that argument a bit murkier.
>It does not have access to every fact or bit of information
Indeed. All of the additional 'enhanced' SERPs have basically been 'manually' bolted onto the search results, but in the near future I think their knowledge graph and the increasingly structured/semantic web will have a bigger role.
It truly is a great topic to think about, unfortunately it is personal to a lot of members because it's involving livelihoods here too. What I'm finding most interesting is breaking down the idea of a fact, and how someone can come to the conclusion that they're the owner of that piece of information.
FWIW I think Google pushes the boundaries, sometimes a little too far but ultimately someone else would be pushing the envelope if it wasn't them.
| 10:47 pm on Dec 20, 2013 (gmt 0)|
|...but the idea that one party can take, repurpose and profit from the endeavours and intellectual property of another, with total impunity, does not fit within my concept of "fair use". |
Google asserts that the SERPs are it's legally protected free speech, a product of editorial opinion.
| 11:55 pm on Dec 20, 2013 (gmt 0)|
|What I'm finding most interesting is breaking down the idea of a fact, and how someone can come to the conclusion that they're the owner of that piece of information. |
I don't think webmasters think they own the facts. To me it seems the complaints are that the presentation of these facts was taken straight from their websites word for word. To take the example from the screenshot previously published in New look "Google Knowledge" replaces results with content [webmasterworld.com] thread:
If Google has "learned" about these facts and then compiled/constructed the text in their own words and published this text as their Knowledge Graph then the complaints of Google using someone's else hard work would not stand.
On the other hand, Google could easily address Knowledge Graph text complaints by introducing a new robots meta tag (e.g. something like "noknowledgegraph") and ask webmasters to put it on their site if they do not want their text to appear in Knowledge Graph.
Whilst webmaster would then have a choice to "opt out" from the Google Knowledge Graph, in reality this new tag would not change anything with regards to Knowledge Graph SERPs. There would be enough sites that would either not be aware of the new meta or would see this meta tag as an opportunity to appear in Knowledge Graph because other sites may be blocking it.
But brotherhood_of_LAN's comment on scraper then potentially being included in Knowledge Graph would then also be a very valid one.
| 12:10 am on Dec 21, 2013 (gmt 0)|
I think people worry about this stuff too much. Google can't possibly provide a better level of info than other sites, because there will always be a limit as to how much of the scraped info they can show.
They can get away with doing short answers, stuff like "when was elvis born" etc, but there is another thread on webmasterworld at the moment which shows screenshots of medical queries. No patient is ever going to be satisfied with googles five lines about their ailment — would you be? If you were ill? So they are bound to click on a result. Google will never be able to provide a complete answer for those users, because they can't print a whole page of scraped info without getting into legal trouble.
And let's be serious about it... people do not regard google as an expert on medicine. They know full well that google does not employ doctors. So they are unlikely to accept "googles" diagnosis. They are much more likely to be satisfied when they read the same thing on a medical site. That is not something that google can ever fix.
| 4:03 am on Dec 21, 2013 (gmt 0)|
|Google asserts that the SERPs are it's legally protected free speech, a product of editorial opinion |
As a search engine Google will obviously generate SERP's and it is solely their business how they decide the order of ranking for the sites that appear in those SERP's. No argument.
It's when they go beyond that function and start siphoning data from the indexed websites for their own vested self interest that, for most people I suspect, a line has been crossed. That has nothing to do with generating SERP's.
| 8:20 am on Dec 21, 2013 (gmt 0)|
Sorry…. but I have a problem with the argument that if a robots.txt file allows Google to index a site, then by default, it is an indicator they can do whatever they want with the indexed data.
Robots.txt is simply a set of instructions that defines which robots can/cannot access which parts of the website. It is a statement that defines what can be indexed…. that's it. Period.
|Googles answer to "What Is Robots.txt" |
The Robot Exclusion Standard, also known as the Robots Exclusion Protocol or robots.txt protocol, is a convention for advising cooperating web crawlers and other web robots about accessing all or part of a website which is otherwise publicly viewable.
Robots.txt is not a permission statement which says "once you have indexed the data you can take whatever you want and use that data for your own purposes"
| 12:22 pm on Dec 21, 2013 (gmt 0)|
@austtr - I was about to make essentially the same reply but you put it far better than I would.
| 1:43 pm on Dec 21, 2013 (gmt 0)|
Yes, austtr and piatkow.
Also, if we're speaking of Google's Search Engine here, it's a fact that the engine is a database - it indexes content for search but doesn't own it.
It's 'fair use' from Google's part if they just order and categorize the content the way they think suits their users' needs better, or get anonymous statistics based on the most clicked or linked content, but selling the content without a webmaster's permission doesn't fall under 'fair use'.
|brotherhood of LAN|
| 2:02 pm on Dec 21, 2013 (gmt 0)|
>then by default, it is an indicator they can do whatever they want with the indexed data.
I meant robots.txt allows you to opt-out, to circumvent any issues you have with what Google does with data from your site. Indeed it doesn't say what people can and can't do with your data. I don't think there's much middle ground for "you can download the data but are only limited to do X Y and Z with it".
| 9:08 pm on Dec 21, 2013 (gmt 0)|
|Robots.txt is simply a set of instructions that defines which robots can/cannot access which parts of the website. It is a statement that defines what can be indexed…. that's it. Period. |
Crawl. Not index.
Crawl. Not index.
Crawl. Not index.