Forum Moderators: not2easy

Message Too Old, No Replies

A new scraping trend surfacing.

reluctant to bring this up - but you need to know

         

NoLimits

8:00 pm on Sep 12, 2005 (gmt 0)

10+ Year Member



I have noticed that many of my pages are being "scraped"... but not in the same way that I have seen in the past.

Whilst trying to summarize some of my rather long winded articles for other sections of the site that give a brief overview of the story, I pasted some content into Microsoft Word. I proceeded to run the AutoSummarize tool to see if it would come up with anything worthwhile.

I then went to check the competition for the starting phrase, and was very displeased to find that it already exists on another site. The content is without a doubt, a summary of my own content done through Word. The site still looks like an autogenerated 10,000,000 page POS... but now their content is a summary of everyone elses content.

I'm not sure what to make of it... or if there is anything that can be done in regards to this.

Your opinions are welcome, I'd love to hear some thoughts on this.

creepychris

8:08 pm on Sep 12, 2005 (gmt 0)

10+ Year Member



Another scary scraper trend. In the end though, these sites are only so effective at stealing traffic. They don't get repeat visitors, they don't get recommendations or word-of-mouth, they don't get important links pointed towards them. Yes, they do steal traffic from search engines, but how can it really be called 'stealing' if we don't actually own that traffic. The search engines direct people where they will and they make or break their reputation on sending visitors to quality sites. IOW, it's the search engines problem. Yes, I hate the scrapers. But I don't lose any sleep over them.

If I were Google on the other hand, I would be loosing a lot of sleep.

Alioc

9:52 pm on Sep 12, 2005 (gmt 0)

10+ Year Member



Unfortunately, you've just increased the potential number of people who will find the idea neat and act accordingly.

I think we can't do so much against these people as long as the content is pure text and so easy to copy & paste. There sould be a new standard where only browsers and search engines can easily grab content. Something like image to text conversion. But then again there will always be ways to overcome anything.

JoeT321

10:30 pm on Sep 12, 2005 (gmt 0)

10+ Year Member



I assume the way that you think browsers and search engines could be identified to grab content would be based on the User-Agent, which is extremely easily faked.

larryhatch

10:39 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just went Googling for snippets of my own original text.
Heck, there were lots of scrapes, large and small, thankfully most were small.

I don't fuss if somebody quotes a few words and gives a valid (non-php etc.) link back.

What pleased me the most, was that the overwhelming majority were listed as 'supplemental results',
many of those only showing if I click on "see excluded results".

In my goofy niche at least, I see definite improvement. -Larry

Iguana

10:42 pm on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just did an Auto-Summarize on this page in Word and all I got was 'Westway host' an email box.

KimmoA

10:00 am on Sep 13, 2005 (gmt 0)



"Unfortunately, you've just increased the potential number of people who will find the idea neat and act accordingly."

OMG! OMG! Must... scrape... contents! *drools*

"I think we can't do so much against these people as long as the content is pure text and so easy to copy & paste. There sould be a new standard where only browsers and search engines can easily grab content."

Um... no. Worst idea ever. (BTW... it's called Flash, but not even the SEs can find that contents!)

bbd2000

3:07 pm on Sep 13, 2005 (gmt 0)

10+ Year Member



Wow, Thanks for the idea!

Before everyone piles on, this has some valid uses.

For example, I use wiki from time to time as filler for areas I don’t have the time or inclination to write about. For example I mention “funny looking widget” and I need to provide a brief description of “funny looking widget” I normally throw a wiki page in there and move on. I know people (mostly webmasters, I receive few complaints from views) disapprove but I only have so much time and every piece of content on my site can be an original masterpiece.

Now I can copy, paste, summarize, edit quickly and post. The best part is that now my page does not look like the hundred other wiki clones out there.

Does this make me evil or just practical?

YesMom

6:12 am on Sep 14, 2005 (gmt 0)

10+ Year Member



Does this make me evil or just practical?

Evil! Har har...

(just kidding... the question was begging to be answered)

;-)

larryhatch

6:16 am on Sep 14, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bbd: "I normally throw a wiki page in there and move on. "

Are you saying you LINK to the wiki page, or that you COPY the wiki page or what?

If you just copy content, do you provide credits and a link back? -Larry

bbd2000

11:32 am on Sep 14, 2005 (gmt 0)

10+ Year Member



YesMom,

My wife has always said I was evil. I guess she was right, as usual.

Larry,

I’m a big fan of wiki!

I know some people claim it is not authoritative, but in general I find the information as accurate as most other sources. Anyway, a lot of people claiming wiki is not authoritative are trying to sell some other version of the same data.

To your question:

In the past, if I used the entire article, I usually put an “information from wiki – the free encyclopedia” note at the end of the article and link to the article.

If I only used part of the info, I put “for more information visit wiki – the free encyclopedia.” and linked to the article.

I will probably be using more of the second version from now on.

This is more than most sites I stumble across do.

I hope that doesn’t ruffle too many webmasters feathers.

In case you are wondering, wiki content is less that 2% of my site. I have no need or desire to take wholesale from the site. I just view it as a good source for answering reader’s questions quickly.

trialofmiles

2:18 pm on Sep 14, 2005 (gmt 0)

10+ Year Member



bbd2000, when you say wiki, I'm guessing you mean Wikipedia. The term "wiki" is generic for any site that has wiki-style editing, and licensing is on a site-by-site basis.

As for your use of Wikipedia articles, make sure you understand the GNU Free Documentation License, because that's the license you need to adhere to when using Wikipedia articles on your site.

Read following link:
[en.wikipedia.org...]

The important thing that your notice is lacking is details in your page that the content is licensed under the GFDL. Look at the section "Example notice"

Edit: Also read the actual GFDL at [gnu.org...]

Pertinent text:

You may copy and distribute the Document in any medium, either commercially or noncommercially, provided that this License, the copyright notices, and the license notice saying this License applies to the Document are reproduced in all copies, and that you add no other conditions whatsoever to those of this License.

larryhatch

6:37 am on Sep 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hello bbd: If you do as you say, proper credits and a genuine link back,
it sounds perfectly fair to me. Wikipedia may have other requirements,
but if somebody clips a paragraph of mine, credits it to my site with a
non-phony link back, I don't mind at all. -Larry

larryhatch

6:47 am on Sep 16, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Back ( I hope ) to the original topic:

Checking my access logs, I found a few hits from a new referrer, just a hot linked image.
I called up their page, and it was so long that it crashed my computer!

I'm on a dial-up connection. I disabled javascript, called the page up again,
stopped it from loading, then I sucked in the source code, THAT took a long time.

It was rather amazing. Somebody has scraped maybe 2 paragraphs of text from scores
of sites/pages, each with a hot-linked image.

NO credits are given, not even a hint or a name, and and of course no links back
to the original authors / sites / pages.

I will try and 'whois' the perpetrator. That should be easy.
The tedious part is contacting all those other victimized sites. - Larry