| 10:08 pm on Feb 23, 2012 (gmt 0)|
You didn't look hard enough:
| 10:31 pm on Feb 23, 2012 (gmt 0)|
Later hits are now showing that link. The first couple did not.
| 5:49 am on Feb 24, 2012 (gmt 0)|
Hits with, and without, UA:
| 7:38 am on Feb 24, 2012 (gmt 0)|
Is it this guy?
| 6:47 am on Feb 25, 2012 (gmt 0)|
Yeah, it's me.
Sorry about the lack of a user-agent. It was set in the INI but wasn't getting passed to CURL for some reason. I didn't realize that until somebody put a support ticket in asking what was up. Fortunately that was caught and corrected very early.
As far as what the purpose of the crawling is, think SEOMoz.org with a different focus.
Of course, the crawlers respect robots.txt and will abide if you don't want them around.
| 9:27 am on Feb 25, 2012 (gmt 0)|
|Of course, the crawlers respect robots.txt and will abide if you don't want them around. |
Of course, unless your robots.txt is set up in whitelisting format, which virtually nobody does, they don't know who you are initially so there is no disallow to honor, ever.
| 12:19 pm on Feb 25, 2012 (gmt 0)|
|they don't know who you are initially so there is no disallow to honor, ever. |
? Is your site so wide-open that you never have to use a generic "Hey you" ?
Doesn't take much for a robot to honor or dishonor something in that form. I once had an under-construction site with a phony directory set up purely to trap robots ahead of time.
| 1:11 pm on Feb 25, 2012 (gmt 0)|
@lucy, I wasn't talking about webmasters with a clue, I was talking about the 99.9999% out there that most likely don't even have a robots.txt file in the first place.
| 7:54 pm on Feb 27, 2012 (gmt 0)|
Your bot disobeys robots.txt directives. On Feb 23 I added this deny (along with a filter just in case):
And today your bot came, read robots.txt and disobeyed the directive:
18.104.22.168 - - [26/Feb/2012:16:32:27 -0700] "GET example.com/robots.txt HTTP/1.1" 200 2713 "" "EdisterBot (http://www.edister.com/bot.html)"
22.214.171.124 - - [26/Feb/2012:16:32:28 -0700] "GET example.com HTTP/1.1" 403 1061 "" "EdisterBot (http://www.edister.com/bot.html)"
| 10:29 pm on Feb 27, 2012 (gmt 0)|
Can you PM me the domain in question? I need to see if there's anything different, since it's obeying the directive on other sites. I'll get it fixed up, though, I promise.
| 11:42 pm on Feb 27, 2012 (gmt 0)|
I PM'd the info, however there's really no need to "fix" anything. It's no different from thousands of other bots that request robots.txt then disobey the directives. That's why I always add a filter to block new bots until I see how they follow the rules.
| 1:17 am on Feb 28, 2012 (gmt 0)|
Thanks for the info. There is a need to fix it, because there's a very real difference between my bot and those others. The difference is that I actually want it to obey robots.txt. If it's failing to do that then it has to be fixed, for the benefit of all who don't want it to crawl their sites.
| 4:31 am on Feb 28, 2012 (gmt 0)|
FYI, I just attempted a manual crawl on the site you PM'd me and it immediately shut down saying it was denied by robots.txt.
| 9:58 pm on Feb 29, 2012 (gmt 0)|
|user-agent: * |
Thing is, even Googlebot doesn't read that. I found this out tonight when they hit my trap page. I promptly replaced the Googlebot specific directory disallow list.
| 10:04 pm on Feb 29, 2012 (gmt 0)|
EdisterBot hit one of my test sites today. It didn't even bother to fetch robots.txt. It went straight to the index page. Curious though where you are getting your crawl lists from (whois?). The test site has no inbound links and the only visits it gets is from bots scanning IP blocks. It only gets around ten hits a day.
Also, your bot is sending a HTTP_REFERER header, but it's blank.
| 10:21 pm on Feb 29, 2012 (gmt 0)|
Sounds like a bogus bot spoofing the user agent. I've seen a few of them already. The first thing EdisterBot gets before crawling a site is robots.txt
oh, and all new domains are discovered via external links from other sites.
| 10:25 pm on Feb 29, 2012 (gmt 0)|
| 10:26 pm on Feb 29, 2012 (gmt 0)|
And it didn't get robots.txt? I'm looking at the code and it grabs that before doing anything else. What's the domain?
| 10:39 pm on Feb 29, 2012 (gmt 0)|
Sorry, I can't reveal more than what I've already shared.
Have you thought about about caching robots.txt or putting some sort of system in place to verify that EdisterBot is hitting the right robots.txt?
| 11:43 pm on Feb 29, 2012 (gmt 0)|
The system I am using is that it doesn't crawl without getting robots.txt first. Was the robots file in a subfolder or the root? I'm trying to narrow possibilities down.
| 11:47 pm on Feb 29, 2012 (gmt 0)|