Welcome to WebmasterWorld Guest from 54.158.36.59

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

EdisterBot

     
9:27 pm on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




rDNS: web4.twinuff.com.
IHNetworks, LLC, Los Angeles
67.222.96.0 - 67.222.111.255
67.222.96.0/20
robots.txt: yes

Well behaved, took 30 pages but no idea what the purpose is - no info page.
10:08 pm on Feb 23, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You didn't look hard enough:
[edister.com...]
[edister.com...]
10:31 pm on Feb 23, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Later hits are now showing that link. The first couple did not.
5:49 am on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Hits with, and without, UA:

02-23:
web4.twinuff.com [projecthoneypot.org...]
EdisterBot (http://www.edister.com/bot.html)

02-20:
web1.twinuff.com [projecthoneypot.org...]
-

robots.txt? Yes
7:38 am on Feb 24, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Is it this guy?

[jonathanleger.com...]
6:47 am on Feb 25, 2012 (gmt 0)

10+ Year Member



Yeah, it's me.

Sorry about the lack of a user-agent. It was set in the INI but wasn't getting passed to CURL for some reason. I didn't realize that until somebody put a support ticket in asking what was up. Fortunately that was caught and corrected very early.

As far as what the purpose of the crawling is, think SEOMoz.org with a different focus.

Of course, the crawlers respect robots.txt and will abide if you don't want them around.
9:27 am on Feb 25, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Of course, the crawlers respect robots.txt and will abide if you don't want them around.


Of course, unless your robots.txt is set up in whitelisting format, which virtually nobody does, they don't know who you are initially so there is no disallow to honor, ever.
12:19 pm on Feb 25, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



they don't know who you are initially so there is no disallow to honor, ever.

? Is your site so wide-open that you never have to use a generic "Hey you" ?

User-Agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere

Doesn't take much for a robot to honor or dishonor something in that form. I once had an under-construction site with a phony directory set up purely to trap robots ahead of time.
1:11 pm on Feb 25, 2012 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@lucy, I wasn't talking about webmasters with a clue, I was talking about the 99.9999% out there that most likely don't even have a robots.txt file in the first place.
7:54 pm on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@jonathanleger

Your bot disobeys robots.txt directives. On Feb 23 I added this deny (along with a filter just in case):

User-agent: EdisterBot
Disallow: /

And today your bot came, read robots.txt and disobeyed the directive:

67.222.109.37 - - [26/Feb/2012:16:32:27 -0700] "GET example.com/robots.txt HTTP/1.1" 200 2713 "" "EdisterBot (http://www.edister.com/bot.html)"
67.222.109.37 - - [26/Feb/2012:16:32:28 -0700] "GET example.com HTTP/1.1" 403 1061 "" "EdisterBot (http://www.edister.com/bot.html)"
10:29 pm on Feb 27, 2012 (gmt 0)

10+ Year Member



Can you PM me the domain in question? I need to see if there's anything different, since it's obeying the directive on other sites. I'll get it fixed up, though, I promise.
11:42 pm on Feb 27, 2012 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I PM'd the info, however there's really no need to "fix" anything. It's no different from thousands of other bots that request robots.txt then disobey the directives. That's why I always add a filter to block new bots until I see how they follow the rules.
1:17 am on Feb 28, 2012 (gmt 0)

10+ Year Member



Thanks for the info. There is a need to fix it, because there's a very real difference between my bot and those others. The difference is that I actually want it to obey robots.txt. If it's failing to do that then it has to be fixed, for the benefit of all who don't want it to crawl their sites.
4:31 am on Feb 28, 2012 (gmt 0)

10+ Year Member



FYI, I just attempted a manual crawl on the site you PM'd me and it immediately shut down saying it was denied by robots.txt.
9:58 pm on Feb 29, 2012 (gmt 0)



user-agent: *
Disallow: /keepout
Disallow: /getlost
Disallow: /OKhere/dontgohere


Thing is, even Googlebot doesn't read that. I found this out tonight when they hit my trap page. I promptly replaced the Googlebot specific directory disallow list.
10:04 pm on Feb 29, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



EdisterBot hit one of my test sites today. It didn't even bother to fetch robots.txt. It went straight to the index page. Curious though where you are getting your crawl lists from (whois?). The test site has no inbound links and the only visits it gets is from bots scanning IP blocks. It only gets around ten hits a day.

Also, your bot is sending a HTTP_REFERER header, but it's blank.
10:21 pm on Feb 29, 2012 (gmt 0)

10+ Year Member



Sounds like a bogus bot spoofing the user agent. I've seen a few of them already. The first thing EdisterBot gets before crawling a site is robots.txt

oh, and all new domains are discovered via external links from other sites.
10:25 pm on Feb 29, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



67.222.109.36
web1.twinuff.com
HTTP_ACCEPT{'*/*'}
HTTP_REFERER{''}
HTTP_USER_AGENT{'EdisterBot (http://www.edister.com/bot.html)'}
10:26 pm on Feb 29, 2012 (gmt 0)

10+ Year Member



And it didn't get robots.txt? I'm looking at the code and it grabs that before doing anything else. What's the domain?
10:39 pm on Feb 29, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, I can't reveal more than what I've already shared.

Have you thought about about caching robots.txt or putting some sort of system in place to verify that EdisterBot is hitting the right robots.txt?
11:43 pm on Feb 29, 2012 (gmt 0)

10+ Year Member



The system I am using is that it doesn't crawl without getting robots.txt first. Was the robots file in a subfolder or the root? I'm trying to narrow possibilities down.
11:47 pm on Feb 29, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



root
 

Featured Threads

Hot Threads This Week

Hot Threads This Month