Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Rewrote content 24 hours ago- Already scraped. How do I block these bots?

         

SerpsGuy

11:15 pm on Oct 8, 2014 (gmt 0)

10+ Year Member



The bots that are scraping me put my content, sentence by sentence, on total spam garbage sites. Examples:

EDIT: apparently I cannot post any of these spammers URLS. I think its wonderful that we are protecting the scum on the internet. Here is a screenshot of the trash my content is being stolen and pasted onto...

A lot of the sites load to a blank page, but viewing the cached page shows a ton of content pasted into a huge paragraph.

One gives errors that says:
Warning: curl_setopt(): CURLOPT_FOLLOWLOCATION cannot be activated when safe_mode is enabled or an open_basedir is set in /home/domicioneto/www/viewer/aunw9067.php on line 44

Why do these garbage sites scrape content like this, and how is it that there is no viable way to prevent them from doing it?

Additionally, why is it that everytime someone scrapes my content google immediately credits the other sites with my work and puts me in the sandbox? This is seriously destroying my sanity guys.

I am at the point where I am willing to pay to prevent this from happening. Are there any services that can block these scrapers? How do they even know when I publish my content? Would disabling my rss feed help? Its already set to summary only and includes links to my site.


[edited by: brotherhood_of_LAN at 11:17 pm (utc) on Oct 8, 2014]

[edited by: brotherhood_of_LAN at 12:23 am (utc) on Oct 9, 2014]
[edit reason] see post below by me [/edit]

seoskunk

12:20 am on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hey I am assuming your using webmaster tools to index the page as you create and publish it.

That said the only real way to protect your content is to cloak it, use ajax to load the source page for users and then load the page directly for googlebot, I wrote a load of code doing this but in the end threw the idea out the window as it prevents accessibility and natural progression of the net.

My advice is keep ploughing away at content and ignore the scrapers. Try to figure out why your site scores so low in google trust that its outranked so easily.

The curl warning is interesting they are scraping directly from that page and following links. So you could have some fun. In your htaccess instead of banning them write a script that spits out tons of random code and eats the scrapers bandwidth. Then redirect all visits from that ip to the script. You could use a spidertrap to do this automatically.

brotherhood of LAN

12:25 am on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



SerpsGuy, the "outing" of websites in order to get them punished isn't exclusive to linking to them. Having a screenshot of the text is as good as providing the URL, which FWIW would also "out" your website since the content is the same.

seoskunk's suggestion regarding the curl warning sounds good.

seoskunk

12:33 am on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



OHH sorry forgot the big one, essentially these are proxies so reverse dns googlebot in htaccess. something like

<FilesMatch "\.(s?html?¦php[45]?)$">
SetEnvIfNoCase User-Agent "!(Googlebot¦msnbot¦Teoma)" notRDNSbot
#
Order Deny,Allow
Deny from all
Allow from env=notRDNSbot
Allow from googlebot.com
Allow from search.live.com
Allow from ask.com
#
</FilesMatch>

Credit jdMorgan [webmasterworld.com...]

As googlebot crawls the proxy all they get is an error page on the scraper site

not2easy

1:36 am on Oct 9, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Run a whois search on the domain, find the host and submit a DMCA there. Then submit one with Google. The host (if they are a host with "safe harbor" status) has to remove your content, often shuts down the site with a valid DMCA claim.

A lot of people just file a DMCA with Google, but that does not stop the copying, it only keeps that site from showing up in Google - and Google receives millions of DMCAs every month, it can take time. Other search engines can and do still crawl and show the site if you don't contact the host.
For more details, read here: [webmasterworld.com...] for another discussion about the same kind of problem a few months ago.

SerpsGuy

6:41 pm on Oct 9, 2014 (gmt 0)

10+ Year Member



@seoskunk - does that code really block all spoofed user agents? I want to use it, but I already have this in my .htaccess

<FilesMatch "(\.(bak|config|dist|fla|inc|ini|log|psd|sh|sql|swp)|~)$">
Order allow,deny
Deny from all
Satisfy All
</FilesMatch>


I think I can only have one "Order allow,deny" set in here. They have to be integrated and I have no idea how to do that.

SerpsGuy

6:45 pm on Oct 9, 2014 (gmt 0)

10+ Year Member



I read online about a Bot Blocker Script. Essentially, It involves creating a directory of pages on my site that are inaccessible to users. Place a hidden link someplace on my site to those pages or that directory.

Then, I nofollow and noindex that directory, and setup a script of some kind that take the IP address of everyone / every bot that accesses that information and place it into a deny IP list.

It seems to make good sense, because if I block that page and it is accessed then whoever that visitor is, I do not want them on my website. Has anyone here tried that?

One more idea I read about, is to just block entire countries. China specifically, since that is where most of my spam is hosted. I think that would involve blocking a whole block of IP addresses. Again, if anyone has tried this idea (or the other) please share your opinion.

aristotle

6:55 pm on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Then, I nofollow and noindex that directory, and setup a script of some kind that take the IP address of everyone / every bot that accesses that information and place it into a deny IP list.

just offhand that sounds risky, in that you could end up blocking legitimate bots like googlebot, bingbot, etc

dstiles

7:05 pm on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've said this before in here and been ignored every time, but one more effort:

Join WebmasterWorld's "Search Engine Spider and User Agent Identification" and read the past couple of years' postings.

seoskunk

7:12 pm on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I think I can only have one "Order allow,deny" set in here.


Because you are referring to different files (php and html) you should be able to add the second "allow ,deny" without a problem but a good place to ask about this would be in Apache Forum on here (Some real experts there)

The script you are talking about sounds like a "Spidertrap" script, loads of these available and you can make your own. The most effective way to prevent good bots being banned is to block the pages in robots.txt. I wouldn't rely on rel=nofollow for this. However this does then give away the spider trap if someone looks at robots.txt. So two alternatives.

1.Cloak robots.txt and show real one to se's. There should still be some good info on this on Webmasterworld as they did this themselves.

2.Cloak the links in php so they only show when user agent isn't one of the Search Engine bots.

Both these solutions require RDNS of Googlebot and any others.

Oh you might want to sign up to this site as well [projecthoneypot.org...]

not2easy

7:33 pm on Oct 9, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Do not paste things into your htaccess file without understanding what they do, consequences can be far from what you had in mind.

Blocking specific IPs and UAs can be a useful part of defending your content if you know what and where the damage is coming from. There is not a paste and forget solution to protect any site for everything. As dstiles said,
Join WebmasterWorld's "Search Engine Spider and User Agent Identification" and read the past couple of years' postings.
It is all in there, but you need to know what your requirements are, to deal with them. Examine your access logs and see who's doing what on your site.

londrum

7:41 pm on Oct 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



there used to be a script in the php library (on this site) that might help. it automatically blocks the IP of anything that grabs too many pages too quickly. presumably the scrapers would fall foul of that, because they aren't going to wait a second or more between page grabs, like a human would.
if i remember it correctly, it let you adjust the time, and whitelist search engine bots

have a search for the open-source "bad behavior" spam script (sic) on the web as well, because i think that does a similar thing

superclown2

3:38 pm on Oct 10, 2014 (gmt 0)



I used to have a script that led bots on a merry dance reading auto-generated files that took ages to load. In the end I got bored with it and blocked all of India, China, Turkey, Africa, South America in .htaccess. Solved the problem for me.

DerekT

8:50 pm on Oct 14, 2014 (gmt 0)

10+ Year Member



Here is an old thread that I think could be of use. Obviously you would have to update the IP ranges etc. but should be a good starting place.

[webmasterworld.com ]

martaay

1:31 pm on Oct 15, 2014 (gmt 0)

10+ Year Member



-Log all ips that access your site, build yourself a console to organise the data

-Send yourself emails when ips breach hourly/daily/weekly limits, scrapers will stand out quite clearly compared to normal users

-either ban ips or the entire cidr, I find banning the entire cidr helps prevent repeat attacks effectively and often scrapers will just change ip automatically when blocked. I normally vary this by which country is performing the abuse, i.e. if its my home country ill just ban the individual ip unless its a particularly bad scrape, if its somewhere like russia, china ban the entire cidr

-Google/find the tor network IP status page and ban them all (updating the list regularly), will save yourself a great deal of potential scraping hell

This is what we employ at our co running a popular business directory of which scraping is a daily and massive problem. The above takes a great deal of effort and you will have to be on hand 24/7 to stop potential scrapers