|WATCH: latest scam with content writing|
Disguised stolen content - don't be an accomplice!
I ordered a few original articles using an online service. They were exceptionally cheap, and I was thrilled when I received them: they were 100% original (copy / paste several 10-word strings between "" into Google produced no results whatsoever).
I uploaded one of the articles on my website using CSE HTML Validator. All looked perfect.
Then I needed to find a word on that newly created web page and I quickly used Internet Explorer's built-in search. And surprise: the word was not found!
However I was able to see it with my own eyes: there were many occurrences of that particular word on my page. So I had to investigate what was happening. And the "writer" was exposed! He was no writer, just a thief!
Here is how they do it: they copy existing articles that are already duplicates many times. Then they run search & replace operation and globally replace several letters by identical-looking cyrillic or other foreign characters.
So a string that appears as
on my site, when copied from my site and pasted into Google gives no results whatsoever (don't try it with the above example; it won't work).
|"beautiful river Danube. The city Park is also an area" |
However if I copy the text by hand, I find many occurrences of that particular string.
Using an UTF-8 converter, I found out that the actual code was
beаutiful rіvеr Dаnubе. Thе Cіty Park іѕ аlsо an аrea
In other terms, this is pure theft! Don't be an accomplice. There are more or less clever variants of this scheme. The example above is pretty useless, since Google does not see most words, it has no SEO value. A more elaborated scheme modifies only stop words, so the article does have some SEO value and still appears original... until you are exposed and (most likely) banned.
I think this scam has been around a while. I got burned by a guy doing it over a year ago. Luckily, I got him to refund my money!
|I uploaded one of the articles on my website using CSE HTML Validator. |
That is strange, if you publish articles I would expect to use some web application blog, cms etc, which would include storing content in a database and a content editor. And of course the moment you edit the article (even to setup the published date) you would see the problem or at least once you save the content the editor will convert the text and a search will reveal what happened.
I don't know of the particular package you used and whether or not the thief knew in advance how you will publish documents and how you will search/verify them. It's possible they got a hint on how you were going to verify them.
If your whole package works in UTF-8, you would never see anything reverting to numerical entities, so it's all down to your eyeballs.
But after reading the original post, I went back and checked a related issue that has worried me periodically. Somewhere along the line g### has learned to recognize "soft" hyphens-- at least to the extent that it will offer up "recognize" as an alternative if you search for "recog­nize", and vice versa. Last time I tried (probably a few years ago) they had not yet learned this.
Very few 'content writers' actually write anything, and never have.
They rewrite, plagiarise, copy, and bowdlerize, and always have.
Let's face it, no-one, not even a poor student, could afford to supply quality text at the so-called market rate.
But I have to say, shouldn't you have at least checked the code before you FTP'd it?
If nothing else, you can expect a ton of spamlinks hidden in most 'articles'.
Added to which, Google's ability to spot a duplicate at 25,000 miles has improved much - hence the recent wailing and gnashing of teeth as 2m spammers and 10 innocents got wiped out.
Dunno about Bing, but they must be pretty down on duplicates by now, surely?
I guess this all goes back to the ole saying you get what you pay for.
Bells going off warning warning why is this so cheap. Don't know when we will ever learn good content isn't cheap if it is there's going to be a problem. Sit down write a 500 word article then pay yourself what's it worth. I think we all would change the way we try to buy content from then out out.
|They were exceptionally cheap |
Thanks for posting this. It's a reminder that while global freelance services offer conveniences, there are downsides to it as well.
As quadrille pointed out and I can attest to from personal experience, outsourced content writers, whether they are local or globally sourced, carry risks. Even hiring a local college student will not assure freedom from the risk of plagiarism, not to mention a lack of authority.
It may be old hat to some but news to me. I have always written my own articles and just a few days ago decided to hire. I rec'd a few proposals that looked good but surprised to see the discrepancy between the highest and lowest bid making me wonder about spinning, etc. So your post is timely for me.
Anyone else care to spill on any other tricks to watch out for?
Copyscape is a good I call it tool ,but it is a website, to check for ripped content in a phrase etc. We have used it for years bought into the commercial plan and it works like a charm. I never ever post an article without recoding the article as enigma1 said this would have been discovered. Most of the articles we get are written in word, have you ever seen the code in a word document? Publishing any article without going through the source code is a disaster waiting to happen.
I'm confused. I get cheap content to pad out some smaller sites. $7 an article.
So, you don't see this when you receive the articles from them? You don't see this when copy/pasting them? You don't see the errors when uploading? You only see them when you alter the file?
Please explain this. What format do you recieve the articles in whereby you don't see the problems and what do you do to the articles to be able to see the issues.
"they were exceptionally cheap"
serves you right
Correct? The single page I uploaded to my website is still live, and the text appears perfectly to the human eye. However Google does not recognize any words in it, since every word is actually a mix of Western European and Cyrillic characters. Cyrillic is made of unique characters, as well as characters that look exactly like ours. E.g. the letters "M" "o" "c" "b" etc. display the same in Russian and in English, but they actually have a different code. To complicate matters further, the con artist sent me a mix of various cyrillic alphabets, mixed with Wester European.
|So, you don't see this when you receive the articles from them? You don't see this when copy/pasting them? You don't see the errors when uploading? |
I guess this all goes back to the ole saying you get what you pay for.[/sure]My experience is that you CAN get quality articles at low price. Some people just have the ability to write (or talk) endlessly; some don't. I don't. You also get low quality articles - that is normal. But I had never seen deceit before.
Character substitution to avoid copyscrape type programs has been around for many years. When WYSIWYG editors became available, the incidence of this content theft/infringing increased... those who work with hands-on html editors could spot this scam very quickly. With the rise of cut-and-paste webmastering this type of deceitful product worked more successfully. It was up to those purchasing content to make sure they get what they pay for, and most did not/do not make that effort.
Example of one of those "if you let me get away with it I'll keep on doing it" and "your deserve what you get if you don't make the effort to assure the product is suitable" kind of things.
As this is a Content forum, the best advice is to write your own content. Second best advice is to obtain content from known to be good sources. There is no third best advice that I know of.
Oops, just remembered. Been using this for years when pre-proofing ebooks. Same purpose, though the context is less sinister.
The exact content of the braces will depend on your editor's RegEx dialect, and possibly also the p/P toggle (meaning "is/isn't"). For those who don't speak RegEx, it simply means: find any adjacent occurrences of Latin and non-Latin script.
|E.g. the letters "M" "o" "c" "b" etc. display the same in Russian and in English, but they actually have a different code. |
Ok I see now, its visual spoofing so even if you copy/paste the strings you may not see the problem, subject to the editor in use.
Since from the articles you would expect ascii characters in the range of 0-127 decimal, 7-bit character set in other words setup a filter - you should in this type of business. Or do what lucy said above. An editor macro or a browser addon will be more useful in this case.
With the lower ascii range, say the original text is converted to utf-8, stored in the database you run the filter against it and see if any differences occur. Or just apply the filter first and use its result for the content search. For instance:
$str = preg_replace('/[^(\x00-\x7F)]*/','',$str);
That's in php, In your case above it will strip all the foreign characters and you will be aware of the fraud right at the beginning. Kinda late but next time you will have something to counter it.
I don't dispute the ability to use these tools if you are posting low-quality copy on an industrial scale (though I'd question the ROI).
But for most honest webmasters,
1. read the text; if it makes perfect sense, then a quick Google search of a text string will find if there's duplicates.
2. If it doesn't make sense, then someone has almost certainly mangled the text, either a proprietary 'uniquifier' - ALWAYS makes a poor enough job to be seen instantly as the scam that it is. Or they've translated it to a random language, then translated it back; only a dribbling idiot would be taken in.
3. you can't beat a simple glance at the code. That would reveal all known character scams.
And, when paying for product or service; as stated several times above, 'if it looks too good to be true, it probably is'.
If you care about your website; your visitors, your bounce rate, your return rate, your ROI, then you need to set a quality minimum or be beaten by better sites.
|$str = preg_replace('/[^(\x00-\x7F)]*/','',$str); |
I find it simpler to copy a text snippet or two between quotes BY HAND into Google's search box...
A reason we don't buy content. We write every word ourselves. It is laborious and tedious, but google has rewarded us for it. The cliche "content is king" really is true. And it's not so much quantity as quality.
QUALITY NEVER COMES CHEAPER, What you PAY is What you get. Developing content requires regular involvement of content Owner and content writer.
I always use DreamWeaver and create a test page and upload it to server first to check the plagiarism through copyscape. It gives me Design view as well as HTML view of the page, I replace any special characters with their equivalent HTML name like & etc for cross browser compatibility.
Just curious, before I put anything on a html page I paste it into Notepad, thus stripping text of any crazy characters. Wouldn't this work in his instance?
|Just curious, before I put anything on a html page I paste it into Notepad, thus stripping text of any crazy characters. Wouldn't this work in his instance? |
Yes. You would need to save the notepad file, and reopen it in order to see the cheat, though.
What does "crazy characters" mean?
We are now hitting the linguistic Great Divide: People whose primary language is English-- or, at most, a western European language-- and the rest of the world.
There is no inherent diffence between a Roman letter and a Greek or Cyrillic letter. If your text's primary language is English, you could test for non-ASCII or non-Latin-1 characters. But it gets trickier if you're working in a UTF-8 environment, even if it's just so you can use curly quotes without resorting to illegible and space-guzzling entities. Or if you need to bring in the occasional word in an Eastern European language. Then the non-ASCII characters won't jump out and hit you in the face.
(That's dialect-specific. I use it to check for things that can't be auto-converted from UTF-8 to Latin-1.)
Incidentally, curly quotes are in Windows-Latin-1, so they will be de facto recognized by most systems that profess to use ISO-Latin-1. There's an Official Pronouncement about it somewhere. But if I insert something like ο (Greek) or о (Cyrillic) the Forums will go haywire.
Oops sorry. ascii
|Anyone else care to spill on any other tricks to watch out for? |
I hired folks on elance some years ago based on the exceptional feedback and number of projects they had behind.
It turned they would copy an article and replace words by synonyms. The article sounded stupid as some of the synonyms did not fit the context which prompted me to to a careful search for various strings which helped me find the originals.
On the other side, I hired another company through elance which had a similar feedback and I'm still using them. Nothing super duper but good enough for the price I get and purpose those articles serve.
The only original content I could think of is when a writer is in what she/he is writing about. I.e. if the article is about some car, then the writer should sit into the car and drive it around I guess. That would make it really be original.
boblord666, will not work with Unicode Notepad (any modern Windows version), unless you save the file, explicitly ask to save ASCII (or ANSI? anyway, not Unicode), and then reopen the file. May not work even then, if you have Cyrillic code page on your system.
There's a easier way to detect this. I always have translate.google opened in one window, copy paste a section of the article there and then click on "listen". If it reads funny, then something is wrong.
Quick detour to System Preferences confirms that the old-fashioned Speech utility is still there, alongside the heavy-duty Voice Over which I do not understand (but note with interest that lower-case "content" is pronounced "conTENT", like the verb/adjective, while upper-case "Content" is pronounced "CONtent" like the noun). Computer-generated speech turns out to be much better than it was 10-20 years ago when it all came out sounding like someone with a strong Swedish accent. Let alone more decades ago, when it sounded like, well, computer-generated speech.
I pasted your post (Danwebsol) into a text editor and globally changed all the o's to omicrons. The Speech utility treated them like spaces, making it pretty obvious that something was amiss :)
But a check for non-Latin characters is faster and more thorough. Or tell your editor to change the encoding to ASCII or Latin-1 and see if it raises a fuss.
|Publishing any article without going through the source code is a disaster waiting to happen |
Words to live by. Not only can you be scammed in a manner similar to the original poster, but you could have god awful hidden divs, character encoding issues, or just crap, kludged markup.