Forum Moderators: open
[ws.arin.net...]
Nothing ever ever good comes from these ranges; Scrapers, Forum/GuestBook Spam attempts and parasite hosting.
This might be a bit harsh as far as a description for the this host but their customers really got to me in the beginning of 2006.
foof, that was it....
IP: 67.228.175.XX
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*
Connection: Keep-Alive
Content-Length: 0
Host: www.example.com
User-Agent: Google Bot 2 Beta
Accept-Language: en-us
Apperantly there is a Beta 2 of a scraper tool in the works....
Oddly enough, over the past few days I've been getting exactly the same UA from Google on their crawl range 66.249.71.* block. Have google changed UA?
74.86.114.zzz - - [22/Sep/2008:14:14:08 -0500] "GET /MyFolder/Sub/Sub/Sub/index.html HTTP/1.1" 200 10170 "-" "+SitiDi.net/SitiDiBot/1.0 (+Have Good Day)"
74.86.114.zzz - - [22/Sep/2008:14:14:09 -0500] "GET /Mypage.html HTTP/1.1" 200 3656 "-" "aranhabot"
74.86.114.zzz - - [22/Sep/2008:14:14:09 -0500] "GET / HTTP/1.1" 200 6096 "-" "+SitiDi.net/SitiDiBot/1.0 (+Have Good Day)"
74.86.114.zzz - - [22/Sep/2008:14:14:11 -0500] "GET /MyOtherPage.html HTTP/1.1" 403 - "-" "ArabyBot (compatible; Mozilla/5.0; GoogleBot; FAST Crawler 6.4; [araby.com;)"...]
74.86.114.zzz - - [22/Sep/2008:14:14:11 -0500] "GET / HTTP/1.1" 200 6096 "-" "AcoiRobot"
only allowing valid IP ranges for those UAs
That doesn't fly anymore, you need to use full trip DNS checking just like the SEs recommend. Had a SE just recently expand into new IPs, some of you may know which one ;)
However, either full trip DNS or IP range checking boots the fake Googlebots, eliminates proxy hijacking, so I'm always wondering why this topic keeps popping back up since the solution to totally eliminate fake SE spiders has been officially endorsed for 2 years now.
only allowing valid IP ranges for those UAs
That doesn't fly anymore, you need to use full trip DNS checking just like the SEs recommend. Had a SE just recently expand into new IPs, some of you may know which one wink
Bill,
Time and time again in this forum, I've seen multiple solutions which result in the same or similar actions.
What works for one of use may not be useable (by choice or understanding) for another.
If keyplr has an effective method (even though you may percieve flaws) it is after all his boat.
A few been doing these access limitations for so long that we've long had in place restrictions for IP ranges and UA's which do not allow visitor deceptions that others may see.
Perhaps one of the reasons for lack of use of "full trip DNS" is the forum frequency of not providing valid examples in order to keep methods hidden from harvesters and others?
I don't use "full trip DNS" for anything, nor do I have bots harvesting my pages endlessly.
Don
not providing valid examples
I think that is a little unfair - jdMorgan went to great lengths (as usual) last year to show how it can be done in .htaccess - though the technique may not work on some shared hosting configurations.
[webmasterworld.com...]
An alternative method using an auto-prepended PHP file has also been posted:
[webmasterworld.com...]
I confess that I too found it all rather baffling, and abandoned several attempts to implement it - though coincidentally I finally got the latter to work yesterday (go me!) after making a couple of adjustments for my hosting environment.
Meanwhile my sites use the "allow valid IP ranges" method, which (as incrediBILL conceded) is effective enough if you can stay up-to-date with the ranges used by the search engines.
I don't know which SE just expanded into new IPs - but I do know to find out.
...
[edited by: Samizdata at 4:07 pm (utc) on Nov. 8, 2008]
I don't know which SE just expanded into new IPs - but I do know to find out.
I'm an unfair guy ;)
Jim has in the past, and continues to provide marvelous insights in this forum and many others. Unfortuantley, unless links are marked and/or utilized (as you have done) the references and the benefit of Jim's brow is limited.
Jeeves is such a pest, who cares how many ranges they add.
I seem to recall nother new SE range, however the "pokes and probes" couldn't have been too overwhelming, or else I'd had made a notation or access adjustment.
More SoftLayer IP Ranges here.
That company is in my top 8 $%^&*@ list when it comes to scrape attempts and MFA sites location. Don't know what attracts the "evil" to it.
We just got hit with a Softlayer scraper and reported it as well.I'm getting really tired of seeing hosts like SoftLayer (SL) bashed. SL does not do any scraping. Its customers do the scraping. Unless you report them to SL they'll continue to scrape. Every one that I've reported is no longer hosted at SL. They are most likely now hosted by someone else. The Planet perhaps? How about ServerBeach, or a ton of other hosts? So let's blame the scrapers, not the companies that host them.
Now then, why are we arguing about how to recognize bad bots? We've always had the attitude in this forum that whatever works best for you is what you should use. If you want to ban blocks of IP Address do it. If you want to use rDNS to see if a user agent is really from the SE it claims to be from, do it. If you want to rely only on header data, do it. If you want to rely on the pattern of files a bot takes, then that's what's best for you.
eenie, meanie miney, mo works also
and everybody crawled out of the wrong side of the bed
How about ServerBeach
ServerBeach is about as squeaky clean as any data center I've ever found which is why I use them and Peer 1 exclusively these days.
Any AUP violations or hacked servers get swatted down almost instantly.
As far as Softlayer is concerned, when my server was under a botnet attack 2 years ago they took relatively swift action to disable a vulnerable server that keep getting used over and over again.
why are we arguing about how to recognize bad bots
OK, I'll show you why ;)
Here's how I just successfully faked Googlebot using Google's own servers that would pass many bot traps.
I set my user agent to "Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...] and started accessing pages via Google's translator.
Here's what showed up on my server:
66.249.85.85 "GET /testgooglebot HTTP/1.1" "Mozilla/5.0 (compatible; Googlebot/2.1; [google.com...] (via translate.google.com)"
If you're just using "66.249.64.0 - 66.249.95.255" or "66.249.64.0/19" to validate it's a Google IP plus it claims to be Googlebot, and if the Googlebot UA validation is sloppy, your site has just been duped by the Google translator.
Now imagine if I know another location in Google that will do the same thing that doesn't tack on the tell tale ",gzip(gfe) (via translate.google.com)" and you're completely duped.
Similar holes may exist in Yahoo, Live, etc...
If you don't want people to be able to use the SE's facilities against you, which they do, full trip DNS is the only way to fly.
The fact remains, however, that lots of unwanted robotic nuisances come from such places, and as far as I am aware nothing useful is lost by blocking their IP ranges.
If I am wrong about this I would welcome enlightenment.
...
Because of my user agent project I want to attract all the bots I can and see how they behave.
I use a combination of methods to see where they're really from, including rDNS. But up until they get too abusive I won't turn them away. So for me I need the user agent, IP, rDNS, header data, pattern of files taken, whether it reads robots.txt, honeypots, and a few other things.
Others may not need that level of sophistication.
So I have to return to my original statement that we cannot tell others the best way to handle bots.
It's got to be whatever method works best for them. Even if that means the occasional phony bot slips through. For them the amount of code that has to be written to handle these bots might not be worth it. Not everybody is dealing with thousands of uniques a day. So the few bad bots that visit just aren't the drain on resources that they might be on other sites.
So now there are a few scenarios to consider
If you're doing full UA matching it could break if Googlebot changes a single letter.
"^Mozilla/5\.0 (compatible; Googlebot/2\.1; [google\.com...]
Then of course the looser UA matching can be duped as shown above:
"^Mozilla/5\.0\ \(compatible;\ googlebot/"
So the best bet of full functionality with the least likelihood of being duped or broken by changes at Google is the code Jim showed in the other thread where you only specify the minimum matching value of "googlebot" and let full trip DNS do it's job.
FWIW, all bot blocking methods are perfectly valid, just trying to show the pitfalls, traps and loopholes that other methods may present is all.
Because of my user agent project I want to attract all the bots I can and see how they behave
Not disagreeing with you as I too let things in just a little to see how it behaves.
Just showing the flaws in various methodologies for those that want their filters as tight as possible.
Some may not need it or be able to do it, just education.