| 4:14 pm on Jun 2, 2005 (gmt 0)|
Google is a scraper site. A scraper site is any site that does not make their own content and uses a bot to crawl the web and publish snippits of other sites. Now when it is reffered to here they mean some guy bought some scraper software and puts up several thousand pages with adsense on it.
| 4:43 pm on Jun 2, 2005 (gmt 0)|
"scraper site" is an abbreviation of "screen scraper site"
screen scraping is a technique where automated tools are used to download a web page and extract (scrape) some of the information on that page in order to place it on another web page.
Sample uses of screen scraping include:
Obtaining stock quotes from another site and displaying the data on your own site.
Grabbing a page from dmoz.org, and reformatting it on your own site to create your own web directory.
Creating a search engine and showing snippets in your SERPS (like Google do)
A scraper site on this forum usually refers to a site something like this...
Someone sets up a web site to show adsense ads for a particular high performing keyword.
They then take a look at sites that perform well in the search engines, and extract some of the text from those sites (maybe a paragraph from each site) and display it on a web page.
Note: frequently, rather than grabbing the text from the web site directly, they scrape the content from Yahoo SERPS snippets.
They then have a page that has a load of related snippets from various sites which is highly targetted to a specific (usually high paying adsense) keyword, but does not contain any useful information.
The search engines have a habit of ranking these pages high in the SERPS. A user types in a search term, visits the scraper site and finds that it's useless. Quite often the user will see the adsense ads which are likely to contain useful information relavant to the serach term they entered, so they click on the ad, and the owner of the scraper site makes some money.
| 4:49 pm on Jun 2, 2005 (gmt 0)|
I would like to point out that there is a large gray area as well. Many legitimate sites begin with scraped data and add value by adding to it, editing it and rearranging it.
As was pointed out Google itself is a scraper site. Scraper abuse is the issue here not scraping itself.
| 4:53 pm on Jun 2, 2005 (gmt 0)|
Lets say I did a bunch of research on a particualr topic and am presenting the research I feel is
"the best" in a manner similar to google search results. Even though I have not used a bot, I am not using my own content but I link directly to the original source, is that considered bad practice or scrapping?
| 4:56 pm on Jun 2, 2005 (gmt 0)|
It's scraping, not scrapping. Automation is considered a key element in the scraper game.
| 5:36 pm on Jun 2, 2005 (gmt 0)|
Even more than automation, UTILITY is key. When Google serps become useless Google falls into the same category as the abusers and while not an abuser itself, it is a scraper victim.
A broad definition could be utility itself which is similar to the classic definition of spam as "anything the recipient doesn't want."
| 6:14 pm on Jun 2, 2005 (gmt 0)|
As used in case law to date, a scraper is a site which makes unauthorized use of another site's copyrighted content. By that operating definition Google is not a scraper, as nobody has to be listed in Google, and Google will respect anybody's wish not to be indexed or cached.
Scrapers offer no such respect for the intellectual property of others, or couch their "respect" in terms of "if you want us to delist you, you have to block all spiders (and be delisted from every legitimate search engine along with our garbage sites)."
| 6:17 pm on Jun 2, 2005 (gmt 0)|
And please, never, ever refer to them as "scrapper" sites like half the folks do!
Edit: Oops, I just noticed oddsod's entry above. Sorry.
| 6:23 pm on Jun 2, 2005 (gmt 0)|
Just a typo, I assure you :P
| 7:43 pm on Jun 2, 2005 (gmt 0)|
|MediaSpree wrote: |
Lets say I did a bunch of research on a particualr topic and am presenting the research I feel is "the best" in a manner similar to google search results. Even though I have not used a bot, I am not using my own content but I link directly to the original source, is that considered bad practice or scrapping?
That's not a scraper, that's a hub.
| 7:49 pm on Jun 2, 2005 (gmt 0)|
Does Google Disables that publisher if i report the websites.I know many websites that are doing the same to my website.
| 8:00 pm on Jun 2, 2005 (gmt 0)|
About being a hub...
If most of your info is from other places already on the net, even if correcly edited for typo's and pages inserted for long articles, won't that demolish your ability to get good page ranking?
| 8:05 pm on Jun 2, 2005 (gmt 0)|
|Does Google Disables that publisher if i report the websites.I know many websites that are doing the same to my website. |
I have seen some that I reported lose AdSense. It doesn't mean that my act had them removed but over a month after I reported them they no longer had AdSense.
I investigated and found that this publisher had many sites using the same layout for each one. All scraper sites. It may take a while but Google will get to them.
| 8:29 pm on Jun 2, 2005 (gmt 0)|
Sunzfan, are you from Dumbarton?
| 9:11 pm on Jun 2, 2005 (gmt 0)|
I wish I could put a link to one of the scrapers so you can see it yourself :), but what they do is use a software which I is kindna like an addon to a link sql, which searches google for a keyword, then abstract all the top links in relation to the keyword and stores it in the database, now they have a site full of links to other site, lol man I hope google does something About it.
You guys must hear this funny story I would like to share.
This funny thing happen to me, I email google about this site which is a scraper site, this site is always #1 in my nitche, and if anyone want to see what site I am talking about just pm so you can see its a full blome scrapper site. Now this is the funny part I check 4 days later lol, and not only is the scrapper site on top but my site which use to always be in 2nd place was out completely, i mean it wasnt even in the top 50 :)
now I dont want to say it happen because I email google, because I want people to email google about this type of sites untyl google does something about it. But is just a funny story I think :)
| 9:27 pm on Jun 2, 2005 (gmt 0)|
The definitive sign of a 'scraper' site is simply that it scrapes (copies) content from other sources.
Scraping is often automated, especially with larger sites where manual copying is too much work. All those phony DMOZ directories are a fine example.
BUT automation is not necessary for the definition. There are manual scrape jobs, tailor made to specific situations.
Scrapers most often do so for adsense or other ad revenue, but even this is not strictly necessary.
You will see silly accusations that Google etc. are scrapers, as if this justifies scraping in general.
Draw your own conclusions why people would do so. -Larry
| 9:31 pm on Jun 2, 2005 (gmt 0)|
I've seen some terrible scaper sites. One I saw was taking one article, then whatever the keyword searched for, this word/phrase got plugged into the article, replacing a certain noun used throughout the article, like an ad-lib.
| 9:43 pm on Jun 2, 2005 (gmt 0)|
I have seen those many times. Funny thing is, they are probably rolling in it (the dough). Ugh!
| 9:43 pm on Jun 2, 2005 (gmt 0)|
I really don't see how google can ever stop these "horrible scraper sites"....simply saying they are useless is not really looking at what they do....for the most part there are "good" scrapers and "bad" scrapers....there are some people that have no idea what they are doing with these scraper tools and put out what ever the default settings are....on the other I've seen a few good examples of what you can build with these sites...what is really bugging me is the blog and ping sites that use rss feeds....try going to blogger and go through 10 sites....you will see at least 2/10 scraper blogs....they are doing this because blog pages are getting amazing rankings....reallllllly fast....(sometimes hours)
I think a great idea for getting these sites would be to get users to rank a site relative their search term...so the user types in green widgets....and then clicks on a SERP....if its a site THEY deam is crap then the grade it accordingly...and this somehow ties into the SERPs algorithm....I am certain this type of system is around the corner....as "awesome" as google engineers think their algorithm is working...it may need a human element....dmoz is way tooooooo slow...
| 9:48 pm on Jun 2, 2005 (gmt 0)|
I'd love to see an example of a ""good" scrapers".
Anyone that takes verbatim text off a site without at the very least linking back, isnít "good" in my book.
| 9:52 pm on Jun 2, 2005 (gmt 0)|
I think visitors do vote, indirectly. This is the question I want to ask the Google Guys in New Orleans. I want to know how much of their algo is based on the human element... where people click, how long they stay... I think this should be given greater priority when determining serps.
| 10:08 pm on Jun 2, 2005 (gmt 0)|
There are so many examples....
google, yahoo, cnn, etc....
don't get caught up in the hype of "scraper sites are bad"
Those "bad" ones will diappear with time....just give it time...if they are screwing up your business report them to the SERPs you are competing in....
I thought I read something about Yahoo doing something like I suggested somewhere but its a very faint memory...
In the end Content is King....if you have a great site full of great information you will get long visits, lots of incoming links, and with SEO lots of traffic...usually "bad" scraper sites can't compete with that....
| 10:18 pm on Jun 2, 2005 (gmt 0)|
Hey, I suggested to Google, when I first started, that they should allow 3 ad positions on each page and I attached a sample page for them to look at... with 3 ad positions. A month later, they announced that 3 ad units were permitted. Coincidence? Maybe. But.... you never know!
Yes, content is king, visitors won't stay on a scraper site, if this was figured more into the algos, scraper sites could be history... or at least a step in the right direction.
| 10:24 pm on Jun 2, 2005 (gmt 0)|
P.S. Google, if you are listening, set a max time into that algo too, otherwise forum/member sites will take over the results.
| 4:52 am on Jun 3, 2005 (gmt 0)|
Look what I found on slashdot
Philipp Lenssen writes "Google registered a trademark for the word "TrustRank", as Search Engine Watch reveals. Is this a sign we can expect a follow-up to Google's PageRank? An earlier, possibly related paper on TrustRank is available; it proposes techniques to <b>semi-automatically</b> separate good pages from spam by the use of a small selection of reputable seed pages."
| 6:24 am on Jun 3, 2005 (gmt 0)|
|I really don't see how google can ever stop these "horrible scraper sites"....simply saying they are useless is not really looking at what they do....for the most part there are "good" scrapers and "bad" scrapers.... |
First, I share the opinion that scrapers can not be good. They are all bad, just making a living of stolen content, messing up the web, cluttering SERPs, annoying web users, and finally making the web a whole lot less important. Just think of it - how cool it would be to enter whatever term into G and see a meaningful relevant high-quality site show up as #1. Followed by other equally relevant sites on #3 to #10. Now, that would be fan-tas-tique!
As to how to stop scrapers: It's simple - stop the money flow towards scrapers by tightening the quality guidelines for publishers
1) Manually check new domains/sites rather than new publishers.
2) Introduce a reporting system for each G user (including advertisers and publishers - their reports should have higher priority).
3) Manually check reported sites: if a certain number of reports has been reached for one publisher, *immediately* check his pages/sites.
4) Whenever there are no higher priorities (see #3) manually check each page/site that has been reported, beginning with those having the highest number of reports.
5) Whenever there are no higher priorities (see #4) manually check each site again, beginning with sites from the highest earning publishers.
6) If there is no useful content up there ('made for AS') ban the respective publisher - forever!
7) On Google Search, penalize *all* sites run by publishers who were thrown out of AS. No matter whether they actually carry AS or not. No matter whether they are 'useful' or not. This could be done by correlating the info to the domain owners on WHOIS.
8) Make all these measures and their consequences very clear to publishers. They should understand that by running scrapers THEY can damage their whole relationship with G.
The consequences -
1) AS becomes quickly very unattractive (financially) for scrapers who rely on AS.
2) AS becomes quickly very unattractive for webmasters who want to do anything with their web skills in the future. (Again, once out you'll never get back into the SERPs again, not even with 'useful' sites.)
3) Google SERPs will automatically clean up once the scrapers/useless sites are gone.
Of course, as mentioned many times before, we have to ask whether this is in the best interest of the AS team (who have high revenue/profit targets, I believe). Removing scrapers will remove a good share of the revenue as well.
I just hope that GG or ASA are listening here as well.
| 6:28 am on Jun 3, 2005 (gmt 0)|
geez Mark, shouldn't they be taken out back and shot just to be sure?
| 6:36 am on Jun 3, 2005 (gmt 0)|
Well, well, the answer is - no, obviously. ;-)
But I am convinced that by putting out higher stakes, G can easily increase content quality almost over night. Just think - if your future as webmaster running your own sites (listed on Google SERPs) depends on whether you run a scraper site or not, would you do it? Would you *really* do it?
| 10:07 am on Jun 3, 2005 (gmt 0)|
|As to how to stop scrapers: It's simple - stop the money flow |
What if the scrapers evolve and find an alternate way to monetise? How about cloaking to redirect visitors to pr0n?
People getting to scrapers via SERPs is a SERPS issue, not an Adsense one. And if Google is working on a solution it will be a SERPs solution (despite GG's noises about feedback on the "Ads by Google" button).
But, this is all off topic.
| This 223 message thread spans 8 pages: 223 (  2 3 4 5 6 7 8 ) > > |