Welcome to WebmasterWorld Guest from 54.167.83.224

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Login to Read Full Article Script

     
12:54 pm on Feb 4, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Flat file 1000 article site pestered by bots, scrapers, and human content thieves.

Rather than a full-blown CMS, and registration wall, or membership package, is there a secure script we could use, and call on pages to insists visitors either; login, or answer a simple non-captcha question to gain access to an article or two?
3:01 pm on Feb 4, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3123
votes: 0


How many users do you have that need access?

human content thieves


That's a bit of a tricky one. Unless you have paid subscriptions, what would stop anyone from registering an account and stealing content? You could perhaps moderate registration if numbers are low, but you're now in the realms of "registration wall, or membership package".

Simply checking for the presence of a cookie might block most "pestering" bots (including Google).
3:25 pm on Feb 4, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Thank you for responding.

"How many users do you have that need access?"

Hard to tell whilst fighting back the pretend humans :)
Guessing around 1000 genuine human enquirers a day.

"human content thieves"

Yes, point taken.

I'm really looking to stop the bots, I should be able to monitor and handle human thieves using htaccess.

So does that make the decision easier?

I'm not a coder, but have managed to build and run the site for over 15 years.

Wise advice and help appreciated :)
7:47 pm on Feb 4, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:July 3, 2006
posts: 3123
votes: 0


Are you using robots.txt? Or are all these bad bots? Are you still wanting to be indexed by search engines?
7:55 pm on Feb 4, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Yes I use robots.txt to Disallow all bots except those we whitelist. All bad bots ignore it of course.

The big 3 SEs love our site, so we let them in via robots.txt, whilst htaccess blocking many of their ancillary bots (image bots, preview bots, feed bots etc)

Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.
11:50 pm on Feb 7, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


*Cough*
1:41 am on Feb 8, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12693
votes: 244


Thought. Scrapers-- including brainless humans using one of those download-the-whole-site utilities-- can very often be identified by timing. Now, it's possible for someone to load up an article, say "Oops, this isn't the one I wanted" and immediately backtrack to request another one. But if your site has thousands of articles, you should be able to set a timer. If they try to collect more than X articles in a minute, or Y articles in five minutes, put up a barrier. A "convince us you're human" type of interaction is probably better than a fixed time limit.

Also check whether they're arriving at each article via the appropriate entry page-- "click to read full article" or journal index or whatever you've got.

Both of those can be done with cookies. How you set the cookie is up to you. (One of the minor happinesses of my life was finding out that cookies don't require a whole separate "cookie language". Use whatever you already know how to do.)
3:10 am on Feb 8, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Thanks Lucy; here you are nudging me towards that cliff-edge beyond the warning sign (remember it?) which reads; "Beware: Geekdom and Nerdiness!"
4:38 am on Feb 8, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Yikes! I peeked over the abyss...

"finding out that cookies don't require a whole separate 'cookie language'"

Translation:
You just need to be a Nerdiphone fluent in Geekspeak.

Set-Cookie: name2=jargon2;
6:33 am on Feb 8, 2014 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12693
votes: 244


... for a given definition of "whatever you already know how to do", which is why my cookies are set in htaccess:

RewriteRule ^silence/$ - [CO=silence:yes:.example.com:262800]


You can also set cookies with php or with javascript or an unknown number of other methods. And read them, of course; not much point to setting a cookie if you don't do anything with it :)

if ($_COOKIE["silence"])
{ longFooter($pagename); }

In general: If I can do something in php, it is safe to say that anyone can do it.
7:05 am on Feb 8, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


Thanks for the clues Lucy: If only I'd known it was *that* easy. :)
10:30 pm on Feb 8, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 5+ Year Member

joined:May 31, 2006
posts:1207
votes: 7


Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.

For real? I would die without SE traffic. And, blocking SE's will not stop scrapers or bots.

Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good.

I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them. I have more than 1,000 pages of original, quality content.
2:44 am on Feb 9, 2014 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:693
votes: 0


007:

"For real?"

Indeed, looking forward to it.

"I would die without SE traffic."

You don't have return visitors?

"Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good."

You're acquainted with a "good" search engine?
We currently only tolerate three, and each of those regularly offends; decency, honour, and integrity, as well as our TOS.

"I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them."

A tad short-sighted, if I may say so Sir :)

"I have more than 1,000 pages of original, quality content."

Well done, look after it, and your legitimate visitors.
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members