Forum Moderators: open
My problem is I have been getting hit by this "spider" -Wget/1.6 on a daily basis :(,
and it is ignoring my robots.txt page.
After looking at my logs page I could only find the following info - came here from using Wget/1.6.
Today I found this info in my logs - c.satsop.techline.com came here from using Wget/1.6.,has anyone heard or seen this before? -
Looks like someone's homepage space?!?!?.
any help would be appreciated.
Thanks
Wget is a (mainly) Unix based tool that downloads sites recursively. It also appears to be used in a spider from [grub.org...] - both of these can mean that the host it comes from could be anywhere as anyone can run it. It has come up a few times recently and these WebmasterWorld threads are probably worth reading.
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
AddType text/x-server-parsed-html .html
This will mean that html pages will be parsed for SSI commands as shtml pages normally are. It does have a slight effect on performance as it means that every page has to be processed by the server, having said that i've never found it to be a problem.
On another note: I took a look at the wget docs, and here's why it's not respecting your robots.txt. file...
# Setting this to off makes Wget not download /robots.txt. Be sure to
# know *exactly* what /robots.txt is and how it is used before changing
# the default!
#robots = on
...and again from the docs..
#Wget works exceedingly well on slow or unstable connections, keeping
#getting the document until it is fully retrieved.
Which may be why it's hitting your server so hard.
And finally, between spider hunter, axis, and now ban bot, I've hacked together a pretty neat little security system fro my web site. Yep, I've learned quite a little bit since coming here to WmW.
Thanks all!
Sure! Security in the context of keeping unwanted spiders and robots out is pretty much a full time job. Using a robots.txt file is great if the robot follows the exclusion protocol, but as I just became aware of since this thread, wget can easily be set to ignore robots.txt. I would guess the same is true for software like ExtractorPro, EmailSiphon, etc....
Now if the spider gets past robots.txt I'll capture his IP address and exclude that. And if he's nice enough to leave a User Agent (which wget can be configured for (I can set the UA in wget to anything I want.)) I will exclude on that, also. It ain't perfect, but it affords security in the sense that most "passive" users will hit my sites and get a blank page, and then move on. However if someone really wanted to go the extra mile and bog down my server I would fall victim, the first time. One mistake I made (laughing at myself) was using one of these "spam buster" scripts. You know, the kind that generate thousands of bogus email addresses to a spider like EmailSiphon. That bogged down my server more than any spam I'm getting, and sending them to a blank page basically accomplishes what I want to do.
As far as any hard-core knowledge and/or tips go, I basically learned everything I'm doing from right here. I'm not trying to schmooz anybody, but this is the single most relevant website for these topics, I've found to date. And IMO, the hottest tool you can have in your arsenal is the site search feature located at the top of the page. LOL, there's so much information here it's frightening.
One thing I would strongly suggest is to test things out before you go live. I have a Win98 setup with Apache and Perl and I use Spam Cop for testing based on UA. Spam Cop allows you to browse a web site using several different User Agent names. When I want to go live on a real server I have a one page site for further testing. awoyo is nothing but a spider trap on a Linux network setup for just that. Since all my content is in .html, I set .htaccess to parse .htm files as cgi. Thanks to Littleman's suggestion, spiders now believe they're getting served .htm static content.
Just as an added measure of protection I have my content directories set above my sites document root. Call me paranoid, but nobody is getting at my content unless I want them to! About two months ago I put the setup live on two of my "real" websites, and it's been working like a charm ever since. "knock on wood"
At any rate, I hope this helps. And don't forget the most important thing. Have fun!
Now if the spider gets past robots.txt I'll capture his IP address and exclude that
I've been trying to write a spider tracking script myself. How do you recognize if you are dealing with a spider or not, if it is not giving a UA? What's to stop a spider writer setting the UA to a basic IE identification?
Cheers, Robin
It doesn't really take as much work as it sounds. I have my script writing a log file in the format Axs uses so I can use Axs to look at the day's hits as "view by user" The script reports all the hits that came from the same host or IP and how far apart they looked at each page. Once I see more than what the "normal" user looks at in one session I get suspicious and look into it further.
I have one IP that comes in every day, requests one page, and leaves. He leaves a UA of "." (which could just be a problem resolving the User Agent) and comes from some run-of-the-mill hosting companies server. It's not mining for email addresses, or content, and it's not bogging down the server, so I leave it alone. I guess it all depends on what you're willing to accept. The thing about spider traps, for "cloaking" or any other reason is that you'll end up reading a lot of log entries, every single day. You learn how to assimilate as much information as you can, as fast as you can, and you quickly get to know who's who and what's what.
Have fun.
Jim