Welcome to WebmasterWorld Guest from 34.229.113.106

Forum Moderators: Ocean10000 & phranque

Message Too Old, No Replies

VERY frequent user - htaccess?

4 seconds per page

     
11:28 pm on May 26, 2002 (gmt 0)

Full Member

10+ Year Member

joined:June 28, 2000
posts:280
votes: 0



BACKGROUND
My directory/search engine has 100% static pages (no cgi generated pages) and the links that we provide go directly to the external sites (we don't have cgi counters on the links).

PROBLEM
I have a hunch people are using programs to grab all the URLS contained within my search engine. The hard work I put into find websites are being devoured by other competitors of mine. They are probably scouring links off my site to add to their search engine.

EXAMPLE
This IP address: 64.90.185.82.nyinternet.net is hitting our site every 2 seconds and grabbing just HTML pages (no .js files and gif files are being grabbed). I think it hit all of our pages within our site with 10 minutes - I'm almost sure of that.

NEED HELP - RESOLUTION?
Would I use htaccess to prevent competitors / programs from grabbing all of my pages? How can I prevent certain programs from hitting my site, but making sure Google and other major SEs will successfully visit? Should I develop something that will prevent an IP address from hitting "X" number of pages within "X" seconds? Again, how would this impact google and other major crawling SEs?

The second alternative is to turn my static links to external sites into dynamic cgi links. That way these bots would not be able to grab all the links to the external sites, right???

What do you think is better resolution to this problem? How can I protect myself from my competition without protecting myself from Google and the major search engines??? Thanks so much for your help.

Brad

11:33 pm on May 26, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 29, 2002
posts:558
votes: 0


The second alternative is to turn my static links to external sites into dynamic cgi links. That way these bots would not be able to grab all the links to the external sites, right???

YEs but nor would google et al.

11:35 pm on May 26, 2002 (gmt 0)

Full Member

10+ Year Member

joined:June 28, 2000
posts:280
votes: 0


yes, and that is my cocern.....So I wonder if there is something that I can set up at the server level that would prevent competitors from slashing through my site......any thoughts?

When 5 pages within 2 seconds are grabbed - thats some super human reading skills! ;)

11:59 pm on May 26, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 29, 2002
posts:558
votes: 0


Sorry above me!
U could try flagging any rapid gets, then disallow those that arn't legit se's etc and then allow the se's full access.
12:48 am on May 27, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Two words, Spider Trap. Try the site search for more info. You should be able to find a script that does this pretty easily.
3:01 am on May 27, 2002 (gmt 0)

Full Member

10+ Year Member

joined:June 28, 2000
posts:280
votes: 0


Key Master,

Hmmm.....i performed the search and read up on spider traps.........I can tell this spider trap thing isn't for a novice.... (I know nothing very very little about programming - I'm a marketing guy :))

Would anyone be willing to help educate me on how to write a very basic script / htaccess / ??? to block this:

64.90.185.82.nyinternet.net

Or is this well beyond me and should I even try? :) Hopefully there are some newbies out there that are like me and would like to learn alongst side me....

3:10 am on May 27, 2002 (gmt 0)

Full Member

10+ Year Member

joined:Dec 7, 2000
posts:267
votes: 0


Would anyone be willing to help educate me on how to write a very basic script / htaccess / ??? to block this:

64.90.185.82.nyinternet.net

very simple.

In your .htaccess file just add below line

------ below lines -----
deny from 123.123.123.123
deny from nyinternet.net
------ above lines -----

Replace 123.123.123.123 with the IP number you want to block and also remember, you can block as many IP as you wish, just keep on adding more lines like "deny from 123.123.123.123" or "deny from .hostname"

Hope this helps.

3:53 am on May 27, 2002 (gmt 0)

Full Member

10+ Year Member

joined:June 28, 2000
posts:280
votes: 0


Well, I don't have a htaccess file to start with, so I'll begin with what i think I need to do first:

1. First create a txt file and then rename it: .htaccess

2. When I upload it to the server, upload it in ASCII form (not binary)

3. I know I have to change the permissions on the file once its on the server (but I don't know exactly to what). Do you know how I should CHMOD it?

4:05 am on May 27, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5505
votes: 5


Brad,
My ".htacces"
has a chmod value of 644
with a rwrr
8:26 am on May 27, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 29, 2002
posts:558
votes: 0


If u have no .hta access you could ban from a root.txt file
2:09 pm on May 27, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5505
votes: 5


<snip>you could ban from a root.txt file>

Hey hurlimann,
"ban" is not really possible with robots.txt. :-(
You can "suggest" to a bot in robots.txt that they don't go where you desire.
Suggestions don't hinder in anyway somebody who doesn't read or have any interest in suggestion.

2:28 pm on May 27, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 29, 2002
posts:558
votes: 0


Wilderness you are quite correct, I stand corrected ban is not the right world!
Bots, should obey but some don't! :)