homepage Welcome to WebmasterWorld Guest from 54.237.54.83
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
EdisterBot
keyplyr




msg:4421085
 9:27 pm on Feb 23, 2012 (gmt 0)


rDNS: web4.twinuff.com.
IHNetworks, LLC, Los Angeles
67.222.96.0 - 67.222.111.255
67.222.96.0/20
robots.txt: yes

Well behaved, took 30 pages but no idea what the purpose is - no info page.

 

incrediBILL




msg:4421103
 10:08 pm on Feb 23, 2012 (gmt 0)

You didn't look hard enough:
[edister.com...]
[edister.com...]

keyplyr




msg:4421115
 10:31 pm on Feb 23, 2012 (gmt 0)

Later hits are now showing that link. The first couple did not.

Pfui




msg:4421223
 5:49 am on Feb 24, 2012 (gmt 0)

Hits with, and without, UA:

02-23:
web4.twinuff.com [projecthoneypot.org...]
EdisterBot (http://www.edister.com/bot.html)

02-20:
web1.twinuff.com [projecthoneypot.org...]
-

robots.txt? Yes

lucy24




msg:4421245
 7:38 am on Feb 24, 2012 (gmt 0)

Is it this guy?

[jonathanleger.com...]

jonathanleger




msg:4421605
 6:47 am on Feb 25, 2012 (gmt 0)

Yeah, it's me.

Sorry about the lack of a user-agent. It was set in the INI but wasn't getting passed to CURL for some reason. I didn't realize that until somebody put a support ticket in asking what was up. Fortunately that was caught and corrected very early.

As far as what the purpose of the crawling is, think SEOMoz.org with a different focus.

Of course, the crawlers respect robots.txt and will abide if you don't want them around.

incrediBILL




msg:4421625
 9:27 am on Feb 25, 2012 (gmt 0)

Of course, the crawlers respect robots.txt and will abide if you don't want them around.


Of course, unless your robots.txt is set up in whitelisting format, which virtually nobody does, they don't know who you are initially so there is no disallow to honor, ever.

lucy24




msg:4421668
 12:19 pm on Feb 25, 2012 (gmt 0)

they don't know who you are initially so there is no disallow to honor, ever.

? Is your site so wide-open that you never have to use a generic "Hey you" ?

User-Agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere

Doesn't take much for a robot to honor or dishonor something in that form. I once had an under-construction site with a phony directory set up purely to trap robots ahead of time.

incrediBILL




msg:4421687
 1:11 pm on Feb 25, 2012 (gmt 0)

@lucy, I wasn't talking about webmasters with a clue, I was talking about the 99.9999% out there that most likely don't even have a robots.txt file in the first place.

keyplyr




msg:4422377
 7:54 pm on Feb 27, 2012 (gmt 0)

@jonathanleger

Your bot disobeys robots.txt directives. On Feb 23 I added this deny (along with a filter just in case):

User-agent: EdisterBot
Disallow: /

And today your bot came, read robots.txt and disobeyed the directive:

67.222.109.37 - - [26/Feb/2012:16:32:27 -0700] "GET example.com/robots.txt HTTP/1.1" 200 2713 "" "EdisterBot (http://www.edister.com/bot.html)"
67.222.109.37 - - [26/Feb/2012:16:32:28 -0700] "GET example.com HTTP/1.1" 403 1061 "" "EdisterBot (http://www.edister.com/bot.html)"

jonathanleger




msg:4422413
 10:29 pm on Feb 27, 2012 (gmt 0)

Can you PM me the domain in question? I need to see if there's anything different, since it's obeying the directive on other sites. I'll get it fixed up, though, I promise.

keyplyr




msg:4422442
 11:42 pm on Feb 27, 2012 (gmt 0)

I PM'd the info, however there's really no need to "fix" anything. It's no different from thousands of other bots that request robots.txt then disobey the directives. That's why I always add a filter to block new bots until I see how they follow the rules.

jonathanleger




msg:4422469
 1:17 am on Feb 28, 2012 (gmt 0)

Thanks for the info. There is a need to fix it, because there's a very real difference between my bot and those others. The difference is that I actually want it to obey robots.txt. If it's failing to do that then it has to be fixed, for the benefit of all who don't want it to crawl their sites.

jonathanleger




msg:4422515
 4:31 am on Feb 28, 2012 (gmt 0)

FYI, I just attempted a manual crawl on the site you PM'd me and it immediately shut down saying it was denied by robots.txt.

Seedy




msg:4423284
 9:58 pm on Feb 29, 2012 (gmt 0)

user-agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere


Thing is, even Googlebot doesn't read that. I found this out tonight when they hit my trap page. I promptly replaced the Googlebot specific directory disallow list.

Key_Master




msg:4423293
 10:04 pm on Feb 29, 2012 (gmt 0)

EdisterBot hit one of my test sites today. It didn't even bother to fetch robots.txt. It went straight to the index page. Curious though where you are getting your crawl lists from (whois?). The test site has no inbound links and the only visits it gets is from bots scanning IP blocks. It only gets around ten hits a day.

Also, your bot is sending a HTTP_REFERER header, but it's blank.

jonathanleger




msg:4423305
 10:21 pm on Feb 29, 2012 (gmt 0)

Sounds like a bogus bot spoofing the user agent. I've seen a few of them already. The first thing EdisterBot gets before crawling a site is robots.txt

oh, and all new domains are discovered via external links from other sites.

Key_Master




msg:4423309
 10:25 pm on Feb 29, 2012 (gmt 0)

67.222.109.36
web1.twinuff.com
HTTP_ACCEPT{'*/*'}
HTTP_REFERER{''}
HTTP_USER_AGENT{'EdisterBot (http://www.edister.com/bot.html)'}

jonathanleger




msg:4423310
 10:26 pm on Feb 29, 2012 (gmt 0)

And it didn't get robots.txt? I'm looking at the code and it grabs that before doing anything else. What's the domain?

Key_Master




msg:4423316
 10:39 pm on Feb 29, 2012 (gmt 0)

Sorry, I can't reveal more than what I've already shared.

Have you thought about about caching robots.txt or putting some sort of system in place to verify that EdisterBot is hitting the right robots.txt?

jonathanleger




msg:4423346
 11:43 pm on Feb 29, 2012 (gmt 0)

The system I am using is that it doesn't crawl without getting robots.txt first. Was the robots file in a subfolder or the root? I'm trying to narrow possibilities down.

Key_Master




msg:4423347
 11:47 pm on Feb 29, 2012 (gmt 0)

root

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved