EdisterBot

Forum Moderators: open

Message Too Old, No Replies

EdisterBot

keyplyr

9:27 pm on Feb 23, 2012 (gmt 0)

rDNS: web4.twinuff.com.
IHNetworks, LLC, Los Angeles
67.222.96.0 - 67.222.111.255
67.222.96.0/20
robots.txt: yes

Well behaved, took 30 pages but no idea what the purpose is - no info page.

incrediBILL

10:08 pm on Feb 23, 2012 (gmt 0)

You didn't look hard enough:
[edister.com...]
[edister.com...]

keyplyr

10:31 pm on Feb 23, 2012 (gmt 0)

Later hits are now showing that link. The first couple did not.

Pfui

5:49 am on Feb 24, 2012 (gmt 0)

Hits with, and without, UA:

02-23:
web4.twinuff.com [projecthoneypot.org...]
EdisterBot (http://www.edister.com/bot.html)

02-20:
web1.twinuff.com [projecthoneypot.org...]
-

robots.txt? Yes

lucy24

7:38 am on Feb 24, 2012 (gmt 0)

Is it this guy?

[jonathanleger.com...]

jonathanleger

6:47 am on Feb 25, 2012 (gmt 0)

Yeah, it's me.

Sorry about the lack of a user-agent. It was set in the INI but wasn't getting passed to CURL for some reason. I didn't realize that until somebody put a support ticket in asking what was up. Fortunately that was caught and corrected very early.

As far as what the purpose of the crawling is, think SEOMoz.org with a different focus.

Of course, the crawlers respect robots.txt and will abide if you don't want them around.

incrediBILL

9:27 am on Feb 25, 2012 (gmt 0)

Of course, the crawlers respect robots.txt and will abide if you don't want them around.

Of course, unless your robots.txt is set up in whitelisting format, which virtually nobody does, they don't know who you are initially so there is no disallow to honor, ever.

lucy24

12:19 pm on Feb 25, 2012 (gmt 0)

they don't know who you are initially so there is no disallow to honor, ever.

? Is your site so wide-open that you never have to use a generic "Hey you" ?

User-Agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere

Doesn't take much for a robot to honor or dishonor something in that form. I once had an under-construction site with a phony directory set up purely to trap robots ahead of time.

incrediBILL

1:11 pm on Feb 25, 2012 (gmt 0)

@lucy, I wasn't talking about webmasters with a clue, I was talking about the 99.9999% out there that most likely don't even have a robots.txt file in the first place.

keyplyr

7:54 pm on Feb 27, 2012 (gmt 0)

@jonathanleger

Your bot disobeys robots.txt directives. On Feb 23 I added this deny (along with a filter just in case):


User-agent: EdisterBot
Disallow: /

And today your bot came, read robots.txt and disobeyed the directive:


67.222.109.37 - - [26/Feb/2012:16:32:27 -0700] "GET example.com/robots.txt HTTP/1.1" 200 2713 "" "EdisterBot (http://www.edister.com/bot.html)"
67.222.109.37 - - [26/Feb/2012:16:32:28 -0700] "GET example.com HTTP/1.1" 403 1061 "" "EdisterBot (http://www.edister.com/bot.html)"

jonathanleger

10:29 pm on Feb 27, 2012 (gmt 0)

Can you PM me the domain in question? I need to see if there's anything different, since it's obeying the directive on other sites. I'll get it fixed up, though, I promise.

keyplyr

11:42 pm on Feb 27, 2012 (gmt 0)

I PM'd the info, however there's really no need to "fix" anything. It's no different from thousands of other bots that request robots.txt then disobey the directives. That's why I always add a filter to block new bots until I see how they follow the rules.

jonathanleger

1:17 am on Feb 28, 2012 (gmt 0)

Thanks for the info. There is a need to fix it, because there's a very real difference between my bot and those others. The difference is that I actually want it to obey robots.txt. If it's failing to do that then it has to be fixed, for the benefit of all who don't want it to crawl their sites.

jonathanleger

4:31 am on Feb 28, 2012 (gmt 0)

FYI, I just attempted a manual crawl on the site you PM'd me and it immediately shut down saying it was denied by robots.txt.

Seedy

9:58 pm on Feb 29, 2012 (gmt 0)

user-agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere

Thing is, even Googlebot doesn't read that. I found this out tonight when they hit my trap page. I promptly replaced the Googlebot specific directory disallow list.

Key_Master

10:04 pm on Feb 29, 2012 (gmt 0)

EdisterBot hit one of my test sites today. It didn't even bother to fetch robots.txt. It went straight to the index page. Curious though where you are getting your crawl lists from (whois?). The test site has no inbound links and the only visits it gets is from bots scanning IP blocks. It only gets around ten hits a day.

Also, your bot is sending a HTTP_REFERER header, but it's blank.

jonathanleger

10:21 pm on Feb 29, 2012 (gmt 0)

Sounds like a bogus bot spoofing the user agent. I've seen a few of them already. The first thing EdisterBot gets before crawling a site is robots.txt

oh, and all new domains are discovered via external links from other sites.

Key_Master

10:25 pm on Feb 29, 2012 (gmt 0)

67.222.109.36
web1.twinuff.com
HTTP_ACCEPT{'*/*'}
HTTP_REFERER{''}
HTTP_USER_AGENT{'EdisterBot (http://www.edister.com/bot.html)'}

jonathanleger

10:26 pm on Feb 29, 2012 (gmt 0)

And it didn't get robots.txt? I'm looking at the code and it grabs that before doing anything else. What's the domain?

Key_Master

10:39 pm on Feb 29, 2012 (gmt 0)

Sorry, I can't reveal more than what I've already shared.

Have you thought about about caching robots.txt or putting some sort of system in place to verify that EdisterBot is hitting the right robots.txt?

jonathanleger

11:43 pm on Feb 29, 2012 (gmt 0)

The system I am using is that it doesn't crawl without getting robots.txt first. Was the robots file in a subfolder or the root? I'm trying to narrow possibilities down.

Key_Master

11:47 pm on Feb 29, 2012 (gmt 0)

root

EdisterBot

keyplyr

incrediBILL

keyplyr

Pfui

lucy24

jonathanleger

incrediBILL

lucy24

incrediBILL

keyplyr

jonathanleger

keyplyr

jonathanleger

jonathanleger

Seedy

Key_Master

jonathanleger

Key_Master

jonathanleger

Key_Master

jonathanleger

Key_Master

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week