homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

your-server.de hosts bad bots --
findfiles.net, heritrix, Mr. X, GrubNG, Eurobot, etc.

 6:56 am on Aug 1, 2009 (gmt 0)


findfiles.net/0.96 (Robot;test_robot@gmx-topmail.de)
robots.txt? Yes BUT ignored it

Since May, partial listing:

Mozilla/5.0 (compatible; heritrix/2.0.2 +http://seekda.com)
robots.txt? YES

Mozilla/5.0 (compatible; heritrix/${pom.version} +http://seekda.com)
robots.txt? YES
Fake ref? YES

Mr. X (Nutch spiderman; [agenteX.googlepages.com...] ; MyEmail)
robots.txt? Yes BUT ignored it

IE 4.01 Win98
(yeah, sure)

Mozilla/5.0 (compatible; proximic; +http://www.proximic.com)
robots.txt? YES

GrubNG 20080128
robots.txt? NO

Eurobot/Nutch-1.0-dev (1.0)
robots.txt? Yes BUT ignored it

Mozilla/5.0 (Windows; U; Windows NT 5.0; de; rv: Gecko/20070713 Firefox/
robots.txt? Yes BUT ignored it



 8:21 pm on Aug 1, 2009 (gmt 0)

A problem with blocking out one of the IP units with nnn is: it's useless trying to resolve anything on that IP if it's reversed. Some of those IP ranges you give are actually of the form nnn.46.75.108 :)

Correct way around ones are those in the ranges below: - -

Any chance of the true initial IP portion so they can be tracked down? I may already have them blocked (as with those above) but it would be nice to check.


 12:12 am on Aug 2, 2009 (gmt 0)

This forum's Charter [webmasterworld.com] requires the obfuscation:

Any IP address or reverse DNS information not expressly belonging to a search engine should be masked as follows:

Example IP: 111.222.333.nnn
Example DNS: nnn.333.222.111.example.com

Additionally, the IPs should be obscured when discussing distributed crawlers that are run from volunteer computers.

I have the full Host info, of course, but bothunting/tracking is too OCD/time-consuming as-is:) so please don't Sticky me for the missinnng details. If you know a server reverses IPs in its Host names, I guess you'll have to swap 'em around yourself as need be, sorry.


 1:48 am on Aug 2, 2009 (gmt 0)

From my tracking script, not just bad bots, but a plethora of open proxies and hacked servers that are constantly used for comment-spam attempts. your-server.de is slowly becoming NETDIREKT.


 2:57 am on Aug 2, 2009 (gmt 0)

Same bot, same webserver space, within two seconds of each other. Note the Hosts...

(If life was like the movie TIMECOP and "The same matter cannot occupy the same space" vis-a-vis scourge hosts/farms/clouds*, these two would be gone in a flash. Forever;)

Mozilla/5.0 (compatible; proximic; +http://www.proximic.com)
robots.txt? YES

08/01 06:20:24

Mozilla/5.0 (compatible; proximic; +http://www.proximic.com)
robots.txt? Yes BUT ignored it

08/01 06:20:25
08/01 06:20:26

*see also:
amazonaws.com plays host to wide variety of bad bots [webmasterworld.com]


 7:01 am on Aug 2, 2009 (gmt 0)

btw, your-server.de is actually hetzner.de, and it's no surprise there are alot rouge bots on their net, since they are one of the cheapest providers in germany, offering quite powerful dedicated servers with unlimited traffic (though bandwidth is reduced after the first two tb in a month, iirc) for a low price.
I've come to find them quite reasonable in dealing with complaints, so if you're tracking open proxies etc anyhow, you might consider handing them a list and asking for action.


 4:21 pm on Aug 2, 2009 (gmt 0)

pfui - I know the charter says to obfuscate, but you obfuscated the wrong bit. It should have been (for some of them) nnn.123.124.125 not 125.124.123.nnn. It is not possible to reverse the IP to detect the offending range if the vital bit is obscured. It is better to give the IP rather than the rDNS.

For example: 47.34.46.nnn resolves to Bell in Canada. The correct IP range should begin nnn.46.34.47 but it's difficult to discover what nnn is and hence which block your-server.de resides on in that instance. nnn could be anything from the 80 to 95 but isn't, nor is it in the 21n range. There are several other possibilties including 77, 78, 79 and in fact it appears to resolve to - I actually have the whole block 78.46.nnn.nnn already blocked. :)

I agree it is not always obvious when to reverse the numbers and I appreciate your time is valuable. I also appreciate your postings. :)

blend27 - I almost wrote the same thing about netdirekt, which is a known exploit source. :)

janharders - life is too short to compile a list of exploites from there. :)


 8:02 pm on Aug 3, 2009 (gmt 0)

dstiles: Again I get what you're saying about which nnn bits to block, or not, in posts.

Here's the thing:

We do rDNS on the server so my Apache ELF entries show visitors by Host name. Plain IPs only appear when there's no Host.

That's why, after white- or blacklisting by UA, I then 403 by Host, and thereafter 403 by IP/CIDR if need be. And that's why the majority of my bot-sighting posts show Host info, not IPs: I don't need to WHOIS every bot-running Host I spot prior to blocking. And I don't have time to WHOIS them just to post.

So where does how we do things leave you in terms of lookups and/or nnn reversals?

You're on your own:)

FWIW, at least vis-a-vis hits to our Class C, the vast majority are by Hosts, and the worst trouble-making Hosts do not reverse IPs in their names.


 10:01 pm on Aug 3, 2009 (gmt 0)

Ah. I now understand.

All of my logs show IP not rDNS. It's faster, although speed is not so much of a problem now (always provided the DNS server doesn't bottleneck).

The only time the server does rDNS is for stats analysis - which I took ages setting up per site and none of the b clients uses! :(

I suppose a problem in blocking by host rather than IP is that server farms often have rDNS set up to the clients' domains (mine all are), so you would need to block a lot of domains instead of a range of IPs. Obviously more selective but in my case much more server-time consuming.

So: I'm on my own. No problem now I understand. :)


 10:53 pm on Aug 3, 2009 (gmt 0)

just to let you guys know others are watching these threads and do appreciate your efforts and sometimes we can even glean "tangential info" ..janharders post #3964072 pointed me to a hitherto unkown ( to me ) dedi server facility in Germany ..at reasonable prices for spec .

I was looking for one ( not urgently but for a future project ) hetzner.de will do nicely :) and I promise to not run bots off it ..

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved