Forum Moderators: goodroi

Message Too Old, No Replies

new badbot: EnaBot/1.0

         

Achernar

1:42 am on Feb 25, 2008 (gmt 0)

10+ Year Member Top Contributors Of The Month



Came to one of my sites.
Doesn't understand gzip, doesn't respect Crawl-delay, doesn't respect disallow rules (with and without wildcards). And worse, plain-text email message (no html, no attachment) sent to their "crawler specialist", is rejected due to:
554-'The message was rejected because it contains prohibited virus or spam content'

Too bad they can't even setup a mail server. :)
Their search technology is doomed from day one.
For me this bot is history.

67.202.59.222 - - [25/Feb/2008:00:12:20 +0100] "GET /robots.txt HTTP/1.1" 200 544 "" "EnaBot/1.1 (http://www.enaball.com/crawler.html)"
...

Disallow: /*TN=1
Disallow: /liens.php?s=1
Disallow: /Lnouv.php?t=2&l=1
Disallow: /Lnouv.php?t=2&l=3
Disallow: /Lnouv.php?t=a&l=1
Disallow: /Lnouv.php?t=a&l=3

...
67.202.59.222 - - [25/Feb/2008:00:28:41 +0100] "GET /liens.php?s=1 HTTP/1.1" 200 12441
67.202.59.222 - - [25/Feb/2008:00:38:47 +0100] "GET /Lnouv.php?t=a&l=3 HTTP/1.1" 200 6583
67.202.59.222 - - [25/Feb/2008:00:38:48 +0100] "GET /Lnouv.php?t=a&l=1 HTTP/1.1" 200 16818
67.202.59.222 - - [25/Feb/2008:00:38:49 +0100] "GET /Lnouv.php?t=a&TN=1 HTTP/1.1" 200 18603
67.202.59.222 - - [25/Feb/2008:00:39:17 +0100] "GET /Lnouv.php?t=2&l=3 HTTP/1.1" 200 7446
67.202.59.222 - - [25/Feb/2008:00:39:19 +0100] "GET /Lnouv.php?t=2&l=1 HTTP/1.1" 200 40175
67.202.59.222 - - [25/Feb/2008:00:39:21 +0100] "GET /Lnouv.php?t=2&TN=1 HTTP/1.1" 200 31483

arknight

8:07 am on Feb 25, 2008 (gmt 0)

10+ Year Member



enaball.com is registered by GoDaddy to their preferred privacy provider: DomainsByProxy.com, and the domain is served from a GoDaddy server.

My server logs show enabot originating from two different IPas so far: 67.202.59.222 and 72.44.33.215

RDNS For 67.202.59.222: ec2-67-202-59-222.compute-1.amazonaws.com
RDNS for 72.44.33.215: ec2-72-44-33-215.compute-1.amazonaws.com

Both enabot IPs are listed as part of Amazon Development Centre South Africa

NetRange: 67.202.0.0 - 67.202.63.255
CIDR: 67.202.0.0/18
NetName: AMAZON-EC2-3
NetHandle: NET-67-202-0-0-1

NetRange: 72.44.32.0 - 72.44.63.255
CIDR: 72.44.32.0/19
NetName: AMAZON-EC2-2
NetHandle: NET-72-44-32-0-1

It is an erratic little bugger. At one point tonight (2008.02.25) it grabbed 10 files for 752,289 bytes total in a 2 second interval.

On 2008.020.24 it spidered the exact same files in the same order in 5 sec.

Amazon development techs should have better sense than this.

roger enaball

6:13 pm on Feb 25, 2008 (gmt 0)

10+ Year Member



Achernar, my apologizes for our crawler. As soon as we saw your post last night we pulled down the crawler to investiage.

It seems we weren't handling things outside of the robots.txt RFC (http://www.robotstxt.org/norobots-rfc.txt) in any sensible or consistent way. In this case, we were stripping anything after a question mark from the url (because '?' isn't allowed in a denyline), but not from the denyline. This caused the url "/Lnouv.php?t=2&l=1" to turn into "/Lnouv.php" before failing a prefix match against "/Lnouv.php?t=2&l=1". This is now fixed, so we should handle your denyline as intended.

(As a side note, our earlier behavior was from robotsparser in the python standard library).

As for why our email was rejecting you, I have no idea. Godaddy is forwarding the crawler email. You can also try me personally at roger@enaball.com, although it is setup the same way. (I'll keep looking into this - missing email really isn't good)

We first started crawling the open web last friday, so as you can imagine learned a lot over the weekend. We also had some
problem where we weren't excluding .iso files.

Thanks, and sorry again for any inconvience,
Roger

roger enaball

10:11 pm on Feb 25, 2008 (gmt 0)

10+ Year Member



Regarding crawl-delay, how would you expect that to be implemented? What we were doing was a 5 second wait between connections, but fetching 10 pages per connection. The thinking what this reduces server load because it avoids much of the overhead of opening and closing connections and also avoids leaving connections open while unused (some webservers have issues with their number of open connections). But as you noted, this results in 10 pages being fetch one after another, and I suspect you aren't alone in not liking this. So the question becomes, what is best to do?

1) Fetch page. Close connection. Wait crawl-delay. Open new connection. Repeat

2) Fetch page. Leave connection open. Wait crawl-delay. Repeat

3) Fetch X number of pages on a single connection. Close connection. Wait crawl-delay (or x * crawl-delay). Repeat.

Or some other possibility that I haven't thought of? Should we do something different if we are a crawl-delay specified in robots.txt as opposed to our default 5 second delay? I am leaning towards #2 for both - does this seem right?

Regarding why we might be fetching from arknight twice - that is a good question. There are three possibilities. (1) Sometimes it is hard to normalize URLS (windows case doesn't matter, but on unix it does. Some subdomains are equivalent to the parent domain, some aren't. Sometimes .org/.net are the same as .com without a redirect. And so on) So it could be you have multiple valid urls to your files in a way we aren't currently able to figure out before fetching the actual files. (2) When we pull the crawlers down to make a change (as we did to support Achernar's robot.txt file) some work gets lost and so redone. This probably could be elimiated. (3) We could have a bug. Any of the three are possible. We are working on minimize #1, eliminating #2, and detecting if we do have #3 so we could fix that.

I would also note Amazon is our hosting provider (hence the aws part of the domainname in the reverse look up), but didn't write any of our code. So really blame for early crawling issues is ours, not theirs.

Achernar - is your email address from a dyndns domain? We've turned off the spam filter, so hopefully can receive emails from you now. But it sounds like there might be a separate phishing blacklist that we can't disable that I suppose might include dyndns.

Anyway, thanks again for bringing up the issues you got saw. I'm sure there were many more people annoyed by the same things you were but didn't communicate them. So if it wasn't for you, we wouldn't have known about them.

Roger

jdMorgan

11:06 pm on Feb 25, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Expected behaviour would be similar to the major 'bots that support Crawl-delay, and that is what you called case #1:

1) Fetch page. Close connection. Wait crawl-delay. Open new connection. Repeat

Case number two might be acceptable if the Crawl-delay was one second or explicitly set to zero:
2) Fetch page. Leave connection open. Wait crawl-delay. Repeat

Case number three, I can only see as being acceptable if the Crawl-delay is specified explicitly as zero.
3) Fetch X number of pages on a single connection. Close connection. Wait crawl-delay (or x * crawl-delay). Repeat.

In the vast majority of cases, the default 'bot behaviour when Crawl-delay is not specified in robots.txt should be case 1, with an interval of at least one second. If a vary large site has a problem with its numerous pages being only slowly crawled, they can go in and declare a short Crawl-delay.

The main reasons for the existence of Crawl-delay are:
1) Slow shared servers -- slow due to old hardware, or slow due to being shared among too many sites.
2) Slow connections -- Limited bandwidth due to network limitations or due to being over-shared.
3) Slow back-ends, bad database implementations, over-complicated page-generation scripts.

On one hand you can say, "We'll pipeline as fast as we can unless the Webmaster says otherwise," but on the other hand, those very Webmasters who don't know about Crawl-delay may come out in droves asking how to ban a User-agent perceived as abusive.

So, your best defense is to spider widely --fetching only a few pages across many domains-- rather than spidering deeply into each domain. Accepting compressed content and using the If-modified-since/Last-modified client/server protocol will help prevent backlash as well.

You'd also do well to stop spidering and review all of the server responses you received over the weekend, checking individual robots.txt file images against your fetch lists; There is no use in continuing to crawl if the robot has any problems with robots.txt handling.

Above all, you'll want to avoid annoying Webmasters; Until you have something to give us in return for our bandwidth, we can be a right testy lot. And once your 'bot gets a bad reputation, it will take a conspicuously big benefit in order to get back in Webmasters' good graces. And of course, threads like this may hang around and haunt you for a very long time.

The winning attitude is to consider your 'bot a guest in our (server) homes. Check robots.txt carefully to see if you need to leave your shoes by the door, and be a polite and respectful guest. :) No matter how generous a guest might eventually be with rewards (e.g. traffic, revenue), no-one wants that guest if he trashes the place every time he shows up.

Understand that your 'bot is unknown to most Webmasters; With all of the scraping and harvesting going on these days, some simply shoot first (permanent 403 for user-agent and/or IP range) and ask questions later - if ever. And in many cases, it is these well-cared-for and well-protected sites that offer uncommon value to your index.

Jim

arknight

4:10 am on Feb 26, 2008 (gmt 0)

10+ Year Member



Nicely elaborated jdMorgan.

Look, I'm a nice enough guy. So far this little pest has only appeared in the logs of a server that is presently sitting upon a great deal of bandwidth slack. Hell, I even let Cyveillance play their stupid games on this server, although on my two main commercial servers, I resist them, because they are bandwidth pigs, who do not ask nicely first. This bot is uncommonly erratic and hammerheaded.

It first appeared on my logs 2008.02.22. The first content it spidered was a small wordpress blog, and it seemed to do a nice tame job of properly spidering it. The bot appeared on 2008.02.23 several times, and picked-up one or two static pages each visit without any apparent logical reason for the choices. It always downloaded the robots.txt first, so I assumed it was a new Search Engine bot coming in off of inbound links.

On the 24th, it spidered a main index file for a fairly large archive of static pages. The index is the largest file in it. It then hit the sub indexes, which exist in the 5 individual folders of the archive, and all together list the same links. This archive consists of Congressional Daily Records that have been marked-up to XHTML, and are heavily inner-linked to provide methods for hot-linking specific sections as pointers from direct citations elsewhere on the web. The main index file is almost 1/4 mb in size presently. This ignorant little bugger has now downloaded these same 6 index files 5 times in 2 days.

The last visit it spidered 16 files totaling 1,042,379 bytes in 1min 15sec. Even more irritating, it is the same files in the same sequence which were spidered 14min 44sec earlier over what seemed to be a decently paced 8min 10secs, except it was actually erratic as hell.

Intervals between GETs:
1 sec
0 sec
21 sec
1 sec
10 sec
1 sec
8 sec
34 sec
1 sec
6min 45 sec
3 sec
2 sec
1 sec
1 sec
1 sec

What I know is that this bot points to a Godaddy domain which is owner-stealthed via Domains by Proxy. This is enough to raise a flag. Why would an honest search engine start-up, which claims to be working on both public and private applications, have a need to hide ownership of their main domain? In my experience this is extremely aberrant behaviour.

I come back to this forum today, and see a post purporting to represent the bot handler, who wonders why persons would care, when they don't throttle bots down through spiders.txt? This arrogance astounds me. Look here partner, you are banging on what I consider to be a private server, which I am kind enough to allow public access to. It is presently not commercial in any shape or form; not 1 ad or referral anywhere to be found on it. I pay for it out of my own pocket, and aside from using it to host a few pet projects, I also use it as a dev testbed/sandbox. It is one hell of a lot easier just to block whole CIDR ranges than it is to constantly be tweaking htaccess and spiders.txt to suit my personal dev needs.

I cannot conceive of any possible present personal utility for allowing access to anything which originates from AMAZON-EC2-2 or AMAZON-EC2-3. Blocking them is as simple as adding two lines to a text file, and I don't even need to shell-in or double-check the Apache manual for syntax first. The host provides an easy to use browser-based interface providing access to this server functionality.

arknight

8:31 am on Feb 28, 2008 (gmt 0)

10+ Year Member



Have decided to block enabot; initially just at the individual I.P. addresses which show in server log.

The following IPs are personally unverified, but show on publicly accessible logs as also being used by Enabot. The IPs were just acquired in a quick skim websearch using the string {enabot} as search term:

67.202.55.112
67.202.49.172
67.202.63.119
67.202.59.222
67.202.48.124
67.202.49.172

They all fall within:
NetRange: 67.202.0.0 - 67.202.63.255
CIDR: 67.202.0.0/18
NetName: AMAZON-EC2-3
NetHandle: NET-67-202-0-0-1

An observation, which may be irrelevant, from the websearch: a significantly higher frequency of sites hosted in Russia, than what I would have predicted, were returned in the search records.