Welcome to WebmasterWorld Guest from 184.108.40.206
Forum Moderators: mack
Just saw this guy, fell into a spider trap:
220.127.116.11 - - [11/Apr/2003:01:31:08 -0600] "GET /a/deep/link.html HTTP/1.1" 200 12589 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
No referer, came in on a deep link (like from a SE), and d/l pages but no images. After about 5 hits, he tried to grab a trap, and got banned. Grabbed a page every 5 secs or so...
IP resolves to Redmond.... did Bill just get himself banned?
I would say that pretty much clinches it as a bot.... on my site, d/l a bunch of HTML and NO Images is WEIRD (My site is PICTURES of WIDGETS!)
You blocked the whole IP- 131.107.xxx.xxx?
What do you think they are doing?
Not that I am all surprised that a 'bot out of Redmond is not polite (that is, respects robots.txt), but you would think MS of ALL people would put every possible doo-hicky and thingamabob into something they write. So, do you think the "Redmond Robot" is about 20 megs in source code size? :)
I might better appreciate your despair if I went all the way up to 131. ;)
It's doesn't much matter to me what they (whether its MS or somebody falsely representing themselves as such) are doing?
For me the determining factor is three fold;
1) First they began visits with both referer and ua blank.
2) When the denies began as a result of their actions in line one above, they changed to a UA to get around that. While still not providing if they are a MS bot or providing a link back to the bot which gives us an answer we desire.
3) now they change IP's
I've learned time and again that every time I deny short that it comes back to haunt me. In this instance the 131.107. may even be short :(
Do you run a spider trap? For me, that has worked really well. I do not want to get too much into it here, but I run a few different versions. The overlap works great, and it catches them in process, not after the fact- I guess that is why I do not go for blocking whole C, B or -GASP- A blocks. I do, on occasion, when traffic is all over an IP range, and I know I do not have to care at all about the range (read Maylasia, Cybervalence, IA, etc)...
I'm sure there are some bots I haven't seen and perhaps never will due to both the content of my sites and the narrow market. Most of the other malicious ones are already denied.
Thanks for the hint :-)
Well, between Jim and I, I think we have perfected the trap! He comes up with a new idea.... then I add something else.... we have it so it is pretty darn foolproof now. And it gets a lot of what gets past the IP and UA blocks. But then there are some that get by that, too.... and I catch in a bandwidth or CPU throttle. If they get by all that- and I just discovered one that did!- they deserve to get whatever they can! (Kidding)
Jim is a bit more cautious than I am in regards to the trap... I am a bit more, uh, proactive. I am ALWAYS banning Ask Jeeves (which is a very poorly behaved spider), and I know Jim makes allowances for that one.
Anyway, I just see it as another line of defense, and I would reccomend you do it!
You have a more varied particpation than Dave and myself in these forumns.
Not sure how you either miss or not understand the concept or method?
Each webmaster makes a determination as part of the goals for their website on visitors and use of their content. In the end it's the overall scheme of things rather than a solitary portion, whether it's pennies or buttons ;)
"My bandwidth" rather than defining pennies might better be interpetd as boundaries.
>>>> security or three-cents-worth of bandwidth
I am pretty happy with the security on my site, that is not too much of an issue for me. I have a firewall that blocks ALL but port 80 (and one other port, but ONLY if the request is from my IP). My cgi is secure, and I do NOT soley rely on .htaccess password protection. So I am pretty happy there.
My pages do 2-3 gigs a day in bandwidth. The guy I took down yesterday did 79.60 MB (directly from AWStats)... he was not caught by all the bells and whistles (and bait) I throw out. That number APROX equals the bandwidth Google has used month to date on my site (It is within 2-3 megs).
My site also has a lot of really good (some very rare!) pictures of widgets on it. My widget pix are ALWAYS turning up in forums (as avatars, or just to illustrate a point). One of my sites- which sells widgets- is always having it's widget pix show up on e-Gag, too. I am sure THAT bandwidth, unchecked, would exceed pennies a day.
But my main site- the NPO- is an informational site, and is over 500,000 pages long. This one was a Yahoo site-of-the-day a couple months ago (great PR boost, btw!) It sells NOTHING, and only receives income from donations and banners on the site. One, as it's "product" is information, I need to protect that product. If you sell something, you are able to proect yourselves from theives and such, to the extent and in the manner you see fit. Blocking site d/lers is how one does protect their information product. Also, since it exists and is paid for by banner ads, I want/need people to see those on my site, not to d/l the information and leave me with nothing for my work making the site.
If I had none of this in place, I am sure I would be looking at upwards of a gig a day extra in bandwidth. As it is, I always see some stupid robot chuck down 500-1000 403's before they get wise. I wonder what that bandwidth would be if they were getting real pages (at 12-15k each) rather than a 1.2k 403 page? (Yes, I can do the math!)
I agree- things like requests for default.ida are a minor annoyance.... and I deal with them (in a different way) while I go after this bigger site d/lers and other bots. But I ALSO have another reason for actively blocking Bots!
I have been the victim of Nameprotect/Cybervallence. Not just the bandwidth drain... but an actual pending lawsuit based on my "possible" trademark infringement. And it was TOTALLY absurd! COMPLETELY! But, I HAD to hire a trademark attorney to defend myself. I would consider that money directly lost due to NOT blocking a spider!
So, you see, I have my reasons- and some quite good- for doing this. I think, to me SPECIFICALLY- it is more than pennies a day. My main site is a NPO, and does accept tax-deductable donations... you are welcome to become a member for only pennies a day! :)
I think it is hard to know how much these rogue spiders drain from an individual site until you block and log them and see for yourself. I know that I have much more of a problem that say Jim Morgan (number-wise), and so I would say it varies GREATLY depending on the website. And I do not think I am exagerationg at all to say my bandwidth would be 25% higher (at least) w/o blocking.
I must say that this forum has made me much more zealous about this topic... I have gone from thinking I am presenting a website to the world, to thinking this is MY private property, and I need to protect it as such. So, I do what I can.
Besides, it is so easy to do!
So, you see, I have my reasons- and some quite good- for doing this.
Hi Dave! Good answer, and I agree with your point entirely. I think some people (not anybody who posted in this thread, though!), can get carried away by the spider hunt...
Thanks for your illuminating comments (I'm ususally not illuminated before 5 pm, ya' know)!
>>> I agree with your point entirely
I was thinking you might, once I made my case. :)
But I agree also that each individual webmaster needs to make the decisions for their specific website, too, and to what extent they need to ban (if at all). I am doing what is good for my site. Don and I disagree often with the extent of a ban, but we both learn from each other, too. Same with Jim and I. But I figure if we keep throwing the info out here, it will help some people. I know I have gotten more than a few PM's about how to ban this or that, and I am more than happy to help out.
>>> I think some people (not anybody who posted in this thread, though!), can get carried away by the spider hunt
I disagree... I HAVE gotten overzealous myself! And have had to cut back a bit. Banned myself once (damn!). I also once misunderstood the extent of a ban on "_vti_bin" (or something like that).
If one is going to ban, one must also look carefully at what you are banning! You HAVE to read those logs, and make adjustments. Just because I- or Don, or Jim, or Martinibuster- say we spotted something and WE are going to ban it, doesn't mean everyone should. That is why I always post the IP, the UA and what it did... I do not bother banning bots that only ask for robots.txt and move on... for ME, that is a waste of time. For others, that is the first indicatiuon of a pending attack, and they act accordingly for their website
So it is GREAT advice to temper ALL ouradvice (from whoever) with a grain of salt, and see how it fits in with the goals and trafic for YOUR website!
Now since MS has a new product that could compete with my product, I silently removed their name from my newsletter list and banned the bot till it stopped. Don't want to ban MS just incase they want to license our stuff.
Also just FYI, I went to a big ISP and read their user policy and found out that at least some give users a static IP for DSL. So I have banned bots from DSL on those ISPís by just using the IP they cam in on in itís entirely.
Wow- this guy is a busy little beaver! Could you possibly PM me the e-mail... I want to see if he signed up for MY newsletter, too.
>>> at least some give users a static IP for DSL
I have one, but I had to ask for it. I have telnet/ssh denied at the firewalll for all but my ONE IP... is that security or what?
Could you possibly PM me the e-mail...
18.104.22.168 ... "MicrosoftPrototypeCrawler (please report obnoxious behavior to firstname.lastname@example.org)"
Anyone can get a hotmail address, but the ip-address is owned by Microsoft.
He is crawling a site here now at about one page every 5-10 minutes always reading /robots.txt and then the page.
His name is Keith Birney and I sent him the url of this thread last evening explaining how he might like to be here before this discussion gets too far afield. You know, damage control.
While I'm here tonight and thinking about it: If Keith does not come to the boards, that does not neccessarily reflect upon that legitimacy of this new bot. Rather, it may only mean that he is busy answering all the others who've communicated with him.
I suggested he slow it down a bit.
He did mention: "(It found your site less than three minutes into the crawl.)"