viggen, the guideline document has been so framed that any rater using this as a check list will act according to what it says. I am sure that most of these raters would follow it as it is.
To recognize copied content (4.1.7), the doc. guides the raters to search for exact text by putting quotation marks around it. If the raters find multiple sources for it, they either are going to assume that the first listed page is the originator which might not always be true or just go on to step 2 to confirm the page being rated as not useful or "Spam", if it had PPC ads (Most content based sites will be having PPC ads).
This also holds true for the guideline to recognize thin affiliates under 5.1.1