Forum Moderators: phranque

Message Too Old, No Replies

VERY frequent user - htaccess?

4 seconds per page

         

Bradley

11:28 pm on May 26, 2002 (gmt 0)

10+ Year Member




BACKGROUND
My directory/search engine has 100% static pages (no cgi generated pages) and the links that we provide go directly to the external sites (we don't have cgi counters on the links).

PROBLEM
I have a hunch people are using programs to grab all the URLS contained within my search engine. The hard work I put into find websites are being devoured by other competitors of mine. They are probably scouring links off my site to add to their search engine.

EXAMPLE
This IP address: 64.90.185.82.nyinternet.net is hitting our site every 2 seconds and grabbing just HTML pages (no .js files and gif files are being grabbed). I think it hit all of our pages within our site with 10 minutes - I'm almost sure of that.

NEED HELP - RESOLUTION?
Would I use htaccess to prevent competitors / programs from grabbing all of my pages? How can I prevent certain programs from hitting my site, but making sure Google and other major SEs will successfully visit? Should I develop something that will prevent an IP address from hitting "X" number of pages within "X" seconds? Again, how would this impact google and other major crawling SEs?

The second alternative is to turn my static links to external sites into dynamic cgi links. That way these bots would not be able to grab all the links to the external sites, right???

What do you think is better resolution to this problem? How can I protect myself from my competition without protecting myself from Google and the major search engines??? Thanks so much for your help.

Brad

hurlimann

11:33 pm on May 26, 2002 (gmt 0)

10+ Year Member



The second alternative is to turn my static links to external sites into dynamic cgi links. That way these bots would not be able to grab all the links to the external sites, right???

YEs but nor would google et al.

Bradley

11:35 pm on May 26, 2002 (gmt 0)

10+ Year Member



yes, and that is my cocern.....So I wonder if there is something that I can set up at the server level that would prevent competitors from slashing through my site......any thoughts?

When 5 pages within 2 seconds are grabbed - thats some super human reading skills! ;)

hurlimann

11:59 pm on May 26, 2002 (gmt 0)

10+ Year Member



Sorry above me!
U could try flagging any rapid gets, then disallow those that arn't legit se's etc and then allow the se's full access.

Key_Master

12:48 am on May 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Two words, Spider Trap. Try the site search for more info. You should be able to find a script that does this pretty easily.

Bradley

3:01 am on May 27, 2002 (gmt 0)

10+ Year Member



Key Master,

Hmmm.....i performed the search and read up on spider traps.........I can tell this spider trap thing isn't for a novice.... (I know nothing very very little about programming - I'm a marketing guy :))

Would anyone be willing to help educate me on how to write a very basic script / htaccess / ??? to block this:

64.90.185.82.nyinternet.net

Or is this well beyond me and should I even try? :) Hopefully there are some newbies out there that are like me and would like to learn alongst side me....

Vishal

3:10 am on May 27, 2002 (gmt 0)

10+ Year Member



Would anyone be willing to help educate me on how to write a very basic script / htaccess / ??? to block this:

64.90.185.82.nyinternet.net

very simple.

In your .htaccess file just add below line

------ below lines -----
deny from 123.123.123.123
deny from nyinternet.net
------ above lines -----

Replace 123.123.123.123 with the IP number you want to block and also remember, you can block as many IP as you wish, just keep on adding more lines like "deny from 123.123.123.123" or "deny from .hostname"

Hope this helps.

Bradley

3:53 am on May 27, 2002 (gmt 0)

10+ Year Member



Well, I don't have a htaccess file to start with, so I'll begin with what i think I need to do first:

1. First create a txt file and then rename it: .htaccess

2. When I upload it to the server, upload it in ASCII form (not binary)

3. I know I have to change the permissions on the file once its on the server (but I don't know exactly to what). Do you know how I should CHMOD it?

wilderness

4:05 am on May 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Brad,
My ".htacces"
has a chmod value of 644
with a rwrr

hurlimann

8:26 am on May 27, 2002 (gmt 0)

10+ Year Member



If u have no .hta access you could ban from a root.txt file

wilderness

2:09 pm on May 27, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<snip>you could ban from a root.txt file>

Hey hurlimann,
"ban" is not really possible with robots.txt. :-(
You can "suggest" to a bot in robots.txt that they don't go where you desire.
Suggestions don't hinder in anyway somebody who doesn't read or have any interest in suggestion.

hurlimann

2:28 pm on May 27, 2002 (gmt 0)

10+ Year Member



Wilderness you are quite correct, I stand corrected ban is not the right world!
Bots, should obey but some don't! :)