Forum Moderators: open

Message Too Old, No Replies

Wget/1.6 again!!

visit's daily and ignores my robots.txt file

         

jimbo_mac

10:35 am on Jun 15, 2001 (gmt 0)

10+ Year Member



Hi all,
Firstly, I would like to say what a great and informative site this is.

My problem is I have been getting hit by this "spider" -Wget/1.6 on a daily basis :(,
and it is ignoring my robots.txt page.

After looking at my logs page I could only find the following info - came here from using Wget/1.6.

Today I found this info in my logs - c.satsop.techline.com came here from using Wget/1.6.,has anyone heard or seen this before? -
Looks like someone's homepage space?!?!?.

any help would be appreciated.

Thanks

theperlyking

11:27 am on Jun 15, 2001 (gmt 0)

10+ Year Member



Hello Jimbo_mac, welcome to WebmasterWorld.

Wget is a (mainly) Unix based tool that downloads sites recursively. It also appears to be used in a spider from [grub.org...] - both of these can mean that the host it comes from could be anywhere as anyone can run it. It has come up a few times recently and these WebmasterWorld threads are probably worth reading.

[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

jimbo_mac

11:48 am on Jun 15, 2001 (gmt 0)

10+ Year Member



theperlyking,

thanks for the help and quick reply

I'm going to try the ban_bot.cgi, see how it goes. Not to happy about having to put the ssi on every page though. Does that mean that everypage will end withshtml as opposed to html?

jimbo_mac

theperlyking

12:14 pm on Jun 15, 2001 (gmt 0)

10+ Year Member



Normally it would mean that you have to use .shtml but you can configure the server by putting the following line in the .htaccess file

AddType text/x-server-parsed-html .html

This will mean that html pages will be parsed for SSI commands as shtml pages normally are. It does have a slight effect on performance as it means that every page has to be processed by the server, having said that i've never found it to be a problem.

jimbo_mac

12:45 pm on Jun 15, 2001 (gmt 0)

10+ Year Member



I don't have acces to the server - so it looks like I will have to take the long route :(

thanks again for all the help.

jimbo_mac

awoyo

5:03 am on Jun 16, 2001 (gmt 0)

10+ Year Member



Why use SSI? If you can use a cgi script to serve content based on UA and/or IP, using if ($ENV{'HTTP_USER_AGENT'} =~ /$ban/) works fine.

On another note: I took a look at the wget docs, and here's why it's not respecting your robots.txt. file...

# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
#robots = on

...and again from the docs..

#Wget works exceedingly well on slow or unstable connections, keeping
#getting the document until it is fully retrieved.

Which may be why it's hitting your server so hard.

And finally, between spider hunter, axis, and now ban bot, I've hacked together a pretty neat little security system fro my web site. Yep, I've learned quite a little bit since coming here to WmW.

Thanks all!

littleman

5:23 am on Jun 16, 2001 (gmt 0)



Awoyo, I agree with you that it is better to run directly from the cgi script than to use it via an ssi call. But, you should also make the effort to hide the fact that the url is a cgi script. Search Engines penalize dynamic content extensions.

awoyo

2:34 pm on Jun 16, 2001 (gmt 0)

10+ Year Member



Littleman, I should have clarified. Thanks to a similar suggestion by you a few months ago I have .htaccess set to parse .htm files as .cgi. Thanks again.

jimbo_mac

4:01 pm on Jun 16, 2001 (gmt 0)

10+ Year Member



awoyo, security sytem? - any great tips on how to do this ???:)

awoyo

7:00 pm on Jun 16, 2001 (gmt 0)

10+ Year Member



"any great tips on how to do this ???"

Sure! Security in the context of keeping unwanted spiders and robots out is pretty much a full time job. Using a robots.txt file is great if the robot follows the exclusion protocol, but as I just became aware of since this thread, wget can easily be set to ignore robots.txt. I would guess the same is true for software like ExtractorPro, EmailSiphon, etc....

Now if the spider gets past robots.txt I'll capture his IP address and exclude that. And if he's nice enough to leave a User Agent (which wget can be configured for (I can set the UA in wget to anything I want.)) I will exclude on that, also. It ain't perfect, but it affords security in the sense that most "passive" users will hit my sites and get a blank page, and then move on. However if someone really wanted to go the extra mile and bog down my server I would fall victim, the first time. One mistake I made (laughing at myself) was using one of these "spam buster" scripts. You know, the kind that generate thousands of bogus email addresses to a spider like EmailSiphon. That bogged down my server more than any spam I'm getting, and sending them to a blank page basically accomplishes what I want to do.

As far as any hard-core knowledge and/or tips go, I basically learned everything I'm doing from right here. I'm not trying to schmooz anybody, but this is the single most relevant website for these topics, I've found to date. And IMO, the hottest tool you can have in your arsenal is the site search feature located at the top of the page. LOL, there's so much information here it's frightening.

One thing I would strongly suggest is to test things out before you go live. I have a Win98 setup with Apache and Perl and I use Spam Cop for testing based on UA. Spam Cop allows you to browse a web site using several different User Agent names. When I want to go live on a real server I have a one page site for further testing. awoyo is nothing but a spider trap on a Linux network setup for just that. Since all my content is in .html, I set .htaccess to parse .htm files as cgi. Thanks to Littleman's suggestion, spiders now believe they're getting served .htm static content.

Just as an added measure of protection I have my content directories set above my sites document root. Call me paranoid, but nobody is getting at my content unless I want them to! About two months ago I put the setup live on two of my "real" websites, and it's been working like a charm ever since. "knock on wood"

At any rate, I hope this helps. And don't forget the most important thing. Have fun!

themoff

10:16 am on Jul 16, 2001 (gmt 0)

10+ Year Member



Now if the spider gets past robots.txt I'll capture his IP address and exclude that

I've been trying to write a spider tracking script myself. How do you recognize if you are dealing with a spider or not, if it is not giving a UA? What's to stop a spider writer setting the UA to a basic IE identification?

Cheers, Robin

awoyo

4:02 am on Jul 17, 2001 (gmt 0)

10+ Year Member



For me, it's just a matter of looking at my log files to determine a pattern. If it's an unknown UA and it's requesting every page on the site in a time frame of say, 2 minutes, then I know it's not a human. If this happens every day, as it was with Wget, then it gets banned. Some of the "Wget" hits were showing Mozilla UA"s but the pattern was the same as Wget.

It doesn't really take as much work as it sounds. I have my script writing a log file in the format Axs uses so I can use Axs to look at the day's hits as "view by user" The script reports all the hits that came from the same host or IP and how far apart they looked at each page. Once I see more than what the "normal" user looks at in one session I get suspicious and look into it further.

I have one IP that comes in every day, requests one page, and leaves. He leaves a UA of "." (which could just be a problem resolving the User Agent) and comes from some run-of-the-mill hosting companies server. It's not mining for email addresses, or content, and it's not bogging down the server, so I leave it alone. I guess it all depends on what you're willing to accept. The thing about spider traps, for "cloaking" or any other reason is that you'll end up reading a lot of log entries, every single day. You learn how to assimilate as much information as you can, as fast as you can, and you quickly get to know who's who and what's what.

Have fun.

Jim