Welcome to WebmasterWorld Guest from 54.167.83.224

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

EdisterBot

     
9:27 pm on Feb 23, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64



rDNS: web4.twinuff.com.
IHNetworks, LLC, Los Angeles
67.222.96.0 - 67.222.111.255
67.222.96.0/20
robots.txt: yes

Well behaved, took 30 pages but no idea what the purpose is - no info page.
10:08 pm on Feb 23, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14621
votes: 85


You didn't look hard enough:
[edister.com...]
[edister.com...]
10:31 pm on Feb 23, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64


Later hits are now showing that link. The first couple did not.
5:49 am on Feb 24, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


Hits with, and without, UA:

02-23:
web4.twinuff.com [projecthoneypot.org...]
EdisterBot (http://www.edister.com/bot.html)

02-20:
web1.twinuff.com [projecthoneypot.org...]
-

robots.txt? Yes
7:38 am on Feb 24, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12693
votes: 244


Is it this guy?

[jonathanleger.com...]
6:47 am on Feb 25, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


Yeah, it's me.

Sorry about the lack of a user-agent. It was set in the INI but wasn't getting passed to CURL for some reason. I didn't realize that until somebody put a support ticket in asking what was up. Fortunately that was caught and corrected very early.

As far as what the purpose of the crawling is, think SEOMoz.org with a different focus.

Of course, the crawlers respect robots.txt and will abide if you don't want them around.
9:27 am on Feb 25, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14621
votes: 85


Of course, the crawlers respect robots.txt and will abide if you don't want them around.


Of course, unless your robots.txt is set up in whitelisting format, which virtually nobody does, they don't know who you are initially so there is no disallow to honor, ever.
12:19 pm on Feb 25, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12693
votes: 244


they don't know who you are initially so there is no disallow to honor, ever.

? Is your site so wide-open that you never have to use a generic "Hey you" ?

User-Agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere

Doesn't take much for a robot to honor or dishonor something in that form. I once had an under-construction site with a phony directory set up purely to trap robots ahead of time.
1:11 pm on Feb 25, 2012 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14621
votes: 85


@lucy, I wasn't talking about webmasters with a clue, I was talking about the 99.9999% out there that most likely don't even have a robots.txt file in the first place.
7:54 pm on Feb 27, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64


@jonathanleger

Your bot disobeys robots.txt directives. On Feb 23 I added this deny (along with a filter just in case):

User-agent: EdisterBot
Disallow: /

And today your bot came, read robots.txt and disobeyed the directive:

67.222.109.37 - - [26/Feb/2012:16:32:27 -0700] "GET example.com/robots.txt HTTP/1.1" 200 2713 "" "EdisterBot (http://www.edister.com/bot.html)"
67.222.109.37 - - [26/Feb/2012:16:32:28 -0700] "GET example.com HTTP/1.1" 403 1061 "" "EdisterBot (http://www.edister.com/bot.html)"
10:29 pm on Feb 27, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


Can you PM me the domain in question? I need to see if there's anything different, since it's obeying the directive on other sites. I'll get it fixed up, though, I promise.
11:42 pm on Feb 27, 2012 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:5797
votes: 64


I PM'd the info, however there's really no need to "fix" anything. It's no different from thousands of other bots that request robots.txt then disobey the directives. That's why I always add a filter to block new bots until I see how they follow the rules.
1:17 am on Feb 28, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


Thanks for the info. There is a need to fix it, because there's a very real difference between my bot and those others. The difference is that I actually want it to obey robots.txt. If it's failing to do that then it has to be fixed, for the benefit of all who don't want it to crawl their sites.
4:31 am on Feb 28, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


FYI, I just attempted a manual crawl on the site you PM'd me and it immediately shut down saying it was denied by robots.txt.
9:58 pm on Feb 29, 2012 (gmt 0)

Junior Member

joined:Feb 28, 2012
posts: 54
votes: 0


user-agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere


Thing is, even Googlebot doesn't read that. I found this out tonight when they hit my trap page. I promptly replaced the Googlebot specific directory disallow list.
10:04 pm on Feb 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


EdisterBot hit one of my test sites today. It didn't even bother to fetch robots.txt. It went straight to the index page. Curious though where you are getting your crawl lists from (whois?). The test site has no inbound links and the only visits it gets is from bots scanning IP blocks. It only gets around ten hits a day.

Also, your bot is sending a HTTP_REFERER header, but it's blank.
10:21 pm on Feb 29, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


Sounds like a bogus bot spoofing the user agent. I've seen a few of them already. The first thing EdisterBot gets before crawling a site is robots.txt

oh, and all new domains are discovered via external links from other sites.
10:25 pm on Feb 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


67.222.109.36
web1.twinuff.com
HTTP_ACCEPT{'*/*'}
HTTP_REFERER{''}
HTTP_USER_AGENT{'EdisterBot (http://www.edister.com/bot.html)'}
10:26 pm on Feb 29, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


And it didn't get robots.txt? I'm looking at the code and it grabs that before doing anything else. What's the domain?
10:39 pm on Feb 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


Sorry, I can't reveal more than what I've already shared.

Have you thought about about caching robots.txt or putting some sort of system in place to verify that EdisterBot is hitting the right robots.txt?
11:43 pm on Feb 29, 2012 (gmt 0)

Preferred Member

10+ Year Member

joined:Jan 23, 2004
posts: 435
votes: 0


The system I am using is that it doesn't crawl without getting robots.txt first. Was the robots file in a subfolder or the root? I'm trying to narrow possibilities down.
11:47 pm on Feb 29, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
posts:1472
votes: 0


root
 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members