Welcome to WebmasterWorld Guest from 54.146.248.111

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Login to Read Full Article Script

     

Angonasec

12:54 pm on Feb 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Flat file 1000 article site pestered by bots, scrapers, and human content thieves.

Rather than a full-blown CMS, and registration wall, or membership package, is there a secure script we could use, and call on pages to insists visitors either; login, or answer a simple non-captcha question to gain access to an article or two?

penders

3:01 pm on Feb 4, 2014 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



How many users do you have that need access?

human content thieves


That's a bit of a tricky one. Unless you have paid subscriptions, what would stop anyone from registering an account and stealing content? You could perhaps moderate registration if numbers are low, but you're now in the realms of "registration wall, or membership package".

Simply checking for the presence of a cookie might block most "pestering" bots (including Google).

Angonasec

3:25 pm on Feb 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thank you for responding.

"How many users do you have that need access?"

Hard to tell whilst fighting back the pretend humans :)
Guessing around 1000 genuine human enquirers a day.

"human content thieves"

Yes, point taken.

I'm really looking to stop the bots, I should be able to monitor and handle human thieves using htaccess.

So does that make the decision easier?

I'm not a coder, but have managed to build and run the site for over 15 years.

Wise advice and help appreciated :)

penders

7:47 pm on Feb 4, 2014 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Are you using robots.txt? Or are all these bad bots? Are you still wanting to be indexed by search engines?

Angonasec

7:55 pm on Feb 4, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes I use robots.txt to Disallow all bots except those we whitelist. All bad bots ignore it of course.

The big 3 SEs love our site, so we let them in via robots.txt, whilst htaccess blocking many of their ancillary bots (image bots, preview bots, feed bots etc)

Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.

Angonasec

11:50 pm on Feb 7, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



*Cough*

lucy24

1:41 am on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Thought. Scrapers-- including brainless humans using one of those download-the-whole-site utilities-- can very often be identified by timing. Now, it's possible for someone to load up an article, say "Oops, this isn't the one I wanted" and immediately backtrack to request another one. But if your site has thousands of articles, you should be able to set a timer. If they try to collect more than X articles in a minute, or Y articles in five minutes, put up a barrier. A "convince us you're human" type of interaction is probably better than a fixed time limit.

Also check whether they're arriving at each article via the appropriate entry page-- "click to read full article" or journal index or whatever you've got.

Both of those can be done with cookies. How you set the cookie is up to you. (One of the minor happinesses of my life was finding out that cookies don't require a whole separate "cookie language". Use whatever you already know how to do.)

Angonasec

3:10 am on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Lucy; here you are nudging me towards that cliff-edge beyond the warning sign (remember it?) which reads; "Beware: Geekdom and Nerdiness!"

Angonasec

4:38 am on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yikes! I peeked over the abyss...

"finding out that cookies don't require a whole separate 'cookie language'"

Translation:
You just need to be a Nerdiphone fluent in Geekspeak.

Set-Cookie: name2=jargon2;

lucy24

6:33 am on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



... for a given definition of "whatever you already know how to do", which is why my cookies are set in htaccess:

RewriteRule ^silence/$ - [CO=silence:yes:.example.com:262800]


You can also set cookies with php or with javascript or an unknown number of other methods. And read them, of course; not much point to setting a cookie if you don't do anything with it :)

if ($_COOKIE["silence"])
{ longFooter($pagename); }

In general: If I can do something in php, it is safe to say that anyone can do it.

Angonasec

7:05 am on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the clues Lucy: If only I'd known it was *that* easy. :)

Swanny007

10:30 pm on Feb 8, 2014 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Longer term, I'd be willing to let SE traffic go, and rely on people knowing where we are, but not just yet.

For real? I would die without SE traffic. And, blocking SE's will not stop scrapers or bots.

Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good.

I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them. I have more than 1,000 pages of original, quality content.

Angonasec

2:44 am on Feb 9, 2014 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



007:

"For real?"

Indeed, looking forward to it.

"I would die without SE traffic."

You don't have return visitors?

"Personally I wouldn't punish regular visitors with CAPTCHA, etc. because of the bad guys. Punish the bad and not the good."

You're acquainted with a "good" search engine?
We currently only tolerate three, and each of those regularly offends; decency, honour, and integrity, as well as our TOS.

"I'm sure my content gets scraped/copied every month but as long as they're not copying my site entirely and I'm still outranking them in the search results, then I'm winning in the end. If someone copies a large part of my site, they or their host will get a DMCA notice. One page or less and it's not even worth my time to go after them."

A tad short-sighted, if I may say so Sir :)

"I have more than 1,000 pages of original, quality content."

Well done, look after it, and your legitimate visitors.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month