Content scrapers on the rise

Massive scraping and hack attacks


grandma genie

11:13 pm on Apr 10, 2011 (gmt 0)



I know this topic has been discussed here on Webmasterworld for years. I have read the posts and am even more discouraged than ever. My site is hosted, so I am limited in what I can do. I use htaccess to block by IP, user agent and query string. But the attacks are coming in greater numbers. Today the content of my whole site was stolen. The scraper came in from one referer, IP address kept changing as well as user agent. There was nothing other than the referer that could be blocked, but the various IPs went from one directory to the next, downloading everything including images and text. I am assuming this was an attack from a Zombie botnet. These content scrapers probably have millions of zombie computers to choose from. They set up their software to rotate from one to the next, stealing as they go. The IPs were not from server farms. I know about speed traps and volume traps, but I don't think that would work in this case, because each IP came and just took one section, then the next IP came in and took another one. What is a webmaster to do who can't write scripting languages and is feeling overwhelmed?


11:36 pm on Apr 10, 2011 (gmt 0)



Do you see a lot of visits from Amazonaws.com in your logs?


12:13 am on Apr 11, 2011 (gmt 0)



"Close the shutters and bar the doors,
They have their own and they want what's yours ;-)"*

Seriously though ( and I says this as someone who reads the spiders and the robots and the apache threads here whenever there is a post ) How many times have you been actually scraped since you began ..and how many times actually been hacked , or had real determined attempts made ? to be saying that scraping and hacking are "on the rise".

Most of us see no more scraper ( or attempted scraper activity ) than we ever did ..and no more hack attempts than there ever were either..

Jumping at shadows or thinking everyone is after your goodies or waiting to write pwned on your front page or serve malware from your contact page is going to wear anyone out ..take precautions ..but don't panic or lose any sleep ..

No-one runs million machine botnets at a single site ..and certainly not to scrape it ..bigs nets are used to send spam, or phish ..much smaller nets of a few thousand ( or tens of thousands at most ) are used to DDOS and disrupt ..and their targets are the big dogs MS , Banks, or government sites ..not small sites.

Scrapers can get an unprotected site with just one program and rotate through proxies for each directory jump ..only needs one person and some "off the shelf" scraper app to do that ( there are even freeware ones )..doesn't need the Russian Mafia ( who have more lucrative things to do anyway ;-) best protection is to white list the good bots ( those that you want ) ..kick all others to the kerb ..and from time to time revise the lists ..

But if you worry all the time about who might be visiting and what their hidden "evil intent" is ..it's like letting your Grandchildren's friends visit with them ..but following them around all the time with a can of mace in your hand in case they steal the spoons or try to molest you..or eat the cat :-)

Worrying, and treating all with suspicion, spoils the experience of having a website(s) and having it visited ( unless it stores masses of personal data and or credit card details or state secrets ) it won't be on the "we got to hack and scrape this list" that Igor and Manuel exchange notes upon, when they aren't counting the profits from their arms , drugs and people trafficking business that is ..IMHO ;-)

*doggerel mine ..made up on the spot ..;-)


12:19 am on Apr 11, 2011 (gmt 0)

Put some phone home stuff on your page and then keep an eye on your logs. You can often get on top of it before the SEs even know it's there.

Most scrapers are stupid. They'll leave an absolute URL in there just because they're too lazy to look for it.

And then break out the DMCA guns.

It's always worked for me.


12:21 am on Apr 11, 2011 (gmt 0)

My site is hosted, so I am limited in what I can do

My sites are hosted too grandma. Everybodys sites are hosted. It doesn't limit you in any sense.

Can you explain that further? We all need a host.

Unless you run your own server and know a little bit about DNS I mean.

grandma genie

2:44 am on Apr 11, 2011 (gmt 0)



I've blocked all the amazonaws IP ranges that I know of. I review my server logs daily and have done so since my site was hacked about two years ago. (That's when I started reading the logs. I didn't know what a log was before then.) My site has been online for about 10 years. My site is hosted by a typical hosting company. I have banned some hosting companies, like The Planet and others that have showed up in the logs trying to gain access to the admin section, using htaccess. But for the last two weeks, I have been inundated by a slew of IPs trying to gain access to the admin. I've taken a variety of steps to stop hack attempts. Those work quite well. But the content scraping is another story. Today was the first time I saw the same referer (asian) initiating a large number (25 to 50) different IPs, all with a variety of user agents. Each one grabbed a different directory. This went on all day today. The only thing I could block was the referer. But that didn't stop them from downloading each directory, one by one. I have a large site with lots of original content and photos. As with all sites, stolen content is happening all the time. This is the first time they grabbed the whole site and there wasn't anything I could do to stop them. The IPs are not bots. They are from Comcast, Road Runner, Qwest, and other phone line owners. That is why I think they are zombie computers. Just regular folks whose computers have been compromised. How do you stop that? I can't find the stolen content on any particular site, so how do you complain? I'll find snippets here and there, usually on those horrible junk sites leading to nowhere. Nobody would go to all the trouble of grabbing my content unless it was lucrative. These are not teenage hackers. The referer site is a foreign p*rn site. If they are not doing it for money, then it is just plain harassment. The referer is involved with the products I sell. What they do with them is, in my opinion, hideously evil.


3:27 am on Apr 11, 2011 (gmt 0)



if you want to find out where your articles go and who took them put some poison pills in your text.

You can hide some stuff in plain sight, bots see it but humans don't, fun CSS tricks. I cloak that stuff out only when I know it's not Google, Yahoo, Bing, etc. so it's supposedly humans doing the crawling. However, if you don't have those skills, just throw in a gibberish word per page like "aardvarkapalooza" as a single word hidden from viewers in CSS but it'll show up all over the place on scrapers pages.

Now if you got any coding skills, add an integer version of an IP address to the end as well, so you put a code in your text like "zzxxyyqqzz-2130706433" which kind of looks like a product part number or something. I convert the IP to a single # because the scripts that grind your code up will use the periods in a IP as a break and spin the IP into 4 parts. I want to track these idiots down, so I make sure it survives in 1 part.

Once they scrape and it gets indexed you simply search for "zzxxyyqqzz", or whatever your unique code is that didn't exist in Google before, and VOILA! they pop up like radioactive tagged rats in a sewer.

The integer IP 2130706433 decodes to, simple math really, and PHP provides ip2long() and long2ip() functions [php.net] to speed you on your way.

Then I scan my logs for that IP, get the user agent as well.

Now I have full trip details proving the idiot scraper scraped my site, hello, ISP, you have a AUP violator, here's my log files, here's the poison pills on his page, hurt him please.


9:14 am on Apr 11, 2011 (gmt 0)

Now I have full trip details proving the idiot scraper scraped my site, hello, ISP, you have a AUP violator, here's my log files, here's the poison pills on his page, hurt him please.

And they will grandma. They'll hurt him. The page or the entire site will go buh bye and sometimes it's just that fast. Bill knows what he's talking about. Better than most actually since he actually studies this.

grandma genie

3:54 pm on Apr 11, 2011 (gmt 0)



Hi Bill and everyone,

Guess what! I got a little braver today and looked for contact information for the adult content site. The site is in Russia. I emailed the owner today and asked him to remove the link. He wrote back right away and showed me what the link said. You need to log in to see his site (I didn't want to do that.) The site somehow combines adult content with stuffed animals. Ugh. But the reference that was bringing all kinds of visitors to my site seems innocent enough. It just indicated that I have a site that sells plushies. So that is why I was getting all those hits yesterday. People who were looking for stuffed animals (hopefully not to have s*x with.) I wrote back and told him not to worry about it. After all, I did have an upswing in sales yesterday. Maybe that was why.

So, it appears no one was trying to steal my site. There was one other troubling issue, however. In investigating the site and its owner, I discovered he also has a love for programming. He has several websites. One of them offers visitors software used to download whole websites. Hmmmmm.

-- Grandma

grandma genie

4:11 pm on Apr 11, 2011 (gmt 0)



One more question:
Along with the initial referer hit, I am seeing this:
GET /default_html/favicon.ico HTTP/1.1"
Does anyone know what that is? I tried Googling it, but didn't find anything.
-- Grandma


4:17 pm on Apr 11, 2011 (gmt 0)



plushies is a name given to western ( mainly paractised in the USA ) sub culture of those who like dressing up as furry animals and having anonymous "relations" with others dressed the same ..there was even a CSI LA episode about it..( amused me immensely ..takes all sorts etc ..;-) this re-inforces what I thought about your suspicions ..your "innocent"
They are from Comcast, Road Runner, Qwest, and other phone line owners

Just regular folks

are in fact US citizens with a taste for what they think he has, and who think your site may be about the same thing ;-) maybe some of his members and their friends were downloading your place for offline browsing for ideas for costumes ?..btw..the USA harbors and creates and hosts far more pron of all sorts than the rest of the world..but frequently disguises it to make it seem more eastern European or exotic amateur material..gets more visitors that way.


4:21 pm on Apr 11, 2011 (gmt 0)



re ..your last question ..each time that a page is called for by a browser ..the browser looks for a "favicon" ( the image that lives up in the address bar )..if you don't have one it will result in a huge number of 404's in your logs..

grandma genie

4:42 pm on Apr 11, 2011 (gmt 0)



Hi Leo,
Well, you learn something new every day. Maybe some of these lost souls will find salvation by visiting my site.

I do have a favicon, but I don't have a default_html, so I am still seeing lots of 404s from those hits.

Poor ole USA. So much for being a "light on a hill."
-- Grandma


4:56 pm on Apr 11, 2011 (gmt 0)



Don't worry about the "default.html"..404's in themselves are not necessarily ( someone else will disagree with me here no doubt ;-) a bad thing ..as long as you know why you are seeing them ..you can't for example prevent someone looking at your site for grandmageniefromwebmasterworld.html ..but their "looking" would still throw a 404..

Just because the visitor or a piece of "offline browser" software is looking for something ..doesn't mean you have to provide it.

btw ..if you want an offline browser of your own ? look up "httrack" ( not a typo ) ..works on windows or linux ( unless the site is blocking it ) ..installs and configures easily ..and it is entirely free open source with no hidden nasties ;-)

If someone steals your stuff you can often use an offline browser to rip the site of the thief and then you have even more proof that they took your things ..images are harder to use G to search for "dupes" for ..searching for scraped text is really easy by comparison..thieves often forget to lock their own doors and bar their own shutters ..


5:43 pm on Apr 11, 2011 (gmt 0)

So much for being a "light on a hill."

Nah. You're still the light on the hill. All you need to do is change the bulb. Make it brighter baby.

grandma genie

6:12 pm on Apr 11, 2011 (gmt 0)



Amen, wyweb!

By the way, Leo. Thank you for the httrack info. Very interesting.

I usually take a snippet of text from my site that is particular to my site, type it into Google in quotes, and can find any other site that is using parts of mine. Most of the time I find the text on those index type pages that have many links to many sites, with the Google ads all over the place. I have not found any site that is a mirror of mine, yet, thank God.

Many of the images on my site can be found in Google Images, where they find their way to all types of forums as hotlinks. I have that blocked in htaccess.

The only other issue that has been happening of late are the typical hack attempts trying to login to the admin through the categories directory. (admin/categories.php/login.php) Those types of attacks are also blocked. They get the 403s. Thanks to jdMorgan for his help. Thanks also to Bill; lots of valuable info from him. These have no referer, a variety of IPs and a variety of user agents. Lots and lots of those within the last two weeks.

-- Grandma


11:54 pm on Apr 12, 2011 (gmt 0)



Our site is scraped by our own affiliates. Many reported & warned. Most comply and remove it. It seems majority of them use automated scraper plugins that works on Wordpress

