homepage Welcome to WebmasterWorld Guest from 54.205.106.111
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Login to Read Full Article Script
Angonasec




msg:4642347
 12:54 pm on Feb 4, 2014 (gmt 0)

Flat file 1000 article site pestered by bots, scrapers, and human content thieves.

Rather than a full-blown CMS, and registration wall, or membership package, is there a secure script we could use, and call on pages to insists visitors either; login, or answer a simple non-captcha question to gain access to an article or two?

 

penders




msg:4642367
 3:01 pm on Feb 4, 2014 (gmt 0)

How many users do you have that need access?

human content thieves


That's a bit of a tricky one. Unless you have paid subscriptions, what would stop anyone from registering an account and stealing content? You could perhaps moderate registration if numbers are low, but you're now in the realms of "registration wall, or membership package".

Simply checking for the presence of a cookie might block most "pestering" bots (including Google).

Angonasec




msg:4642374
 3:25 pm on Feb 4, 2014 (gmt 0)

Thank you for responding.

"How many users do you have that need access?"

Hard to tell whilst fighting back the pretend humans :)
Guessing around 1000 genuine human enquirers a day.

"human content thieves"

Yes, point taken.

I'm really looking to stop the bots, I should be able to monitor and handle human thieves using htaccess.

So does that make the decision easier?

I'm not a coder, but have managed to build and run the site for over 15 years.

Wise advice and help appreciated :)

penders




msg:4642417
 7:47 pm on Feb 4, 2014 (gmt 0)

Are you using robots.txt? Or are all these bad bots? Are you still wanting to be indexed by search engines?

Angonasec




msg:4642418
 7:55 pm on Feb 4, 2014 (gmt 0)

Yes I use robots.txt to Disallow all bots except those we whitelist. All bad bots ignore it of course.

The big 3 SEs love our site, so we let them in via robots.txt, whilst htaccess blocking many of their ancillary bots (image bots, preview bots, feed bots etc)

Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.

Angonasec




msg:4643405
 11:50 pm on Feb 7, 2014 (gmt 0)

*Cough*

lucy24




msg:4643418
 1:41 am on Feb 8, 2014 (gmt 0)

Thought. Scrapers-- including brainless humans using one of those download-the-whole-site utilities-- can very often be identified by timing. Now, it's possible for someone to load up an article, say "Oops, this isn't the one I wanted" and immediately backtrack to request another one. But if your site has thousands of articles, you should be able to set a timer. If they try to collect more than X articles in a minute, or Y articles in five minutes, put up a barrier. A "convince us you're human" type of interaction is probably better than a fixed time limit.

Also check whether they're arriving at each article via the appropriate entry page-- "click to read full article" or journal index or whatever you've got.

Both of those can be done with cookies. How you set the cookie is up to you. (One of the minor happinesses of my life was finding out that cookies don't require a whole separate "cookie language". Use whatever you already know how to do.)

Angonasec




msg:4643422
 3:10 am on Feb 8, 2014 (gmt 0)

Thanks Lucy; here you are nudging me towards that cliff-edge beyond the warning sign (remember it?) which reads; "Beware: Geekdom and Nerdiness!"

Angonasec




msg:4643423
 4:38 am on Feb 8, 2014 (gmt 0)

Yikes! I peeked over the abyss...

"finding out that cookies don't require a whole separate 'cookie language'"

Translation:
You just need to be a Nerdiphone fluent in Geekspeak.

Set-Cookie: name2=jargon2;

lucy24




msg:4643433
 6:33 am on Feb 8, 2014 (gmt 0)

... for a given definition of "whatever you already know how to do", which is why my cookies are set in htaccess:

RewriteRule ^silence/$ - [CO=silence:yes:.example.com:262800]

You can also set cookies with php or with javascript or an unknown number of other methods. And read them, of course; not much point to setting a cookie if you don't do anything with it :)

if ($_COOKIE["silence"])
{ longFooter($pagename); }

In general: If I can do something in php, it is safe to say that anyone can do it.

Angonasec




msg:4643441
 7:05 am on Feb 8, 2014 (gmt 0)

Thanks for the clues Lucy: If only I'd known it was *that* easy. :)

Swanny007




msg:4643578
 10:30 pm on Feb 8, 2014 (gmt 0)

Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.

For real? I would die without SE traffic. And, blocking SE's will not stop scrapers or bots.

Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good.

I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them. I have more than 1,000 pages of original, quality content.

Angonasec




msg:4643639
 2:44 am on Feb 9, 2014 (gmt 0)

007:

"For real?"

Indeed, looking forward to it.

"I would die without SE traffic."

You don't have return visitors?

"Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good."

You're acquainted with a "good" search engine?
We currently only tolerate three, and each of those regularly offends; decency, honour, and integrity, as well as our TOS.

"I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them."

A tad short-sighted, if I may say so Sir :)

"I have more than 1,000 pages of original, quality content."

Well done, look after it, and your legitimate visitors.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved