Forum Moderators: open
The log shows "http://www.boitho.com/dcbot.html". Page says it's creating thumbnail indexes of all the web.
It didn't seem very well mannered, hitting directories that I had renamed, and then when I added these to disallow, it never even opens robots.txt, even tho the site claims you can add an entry to disallow it.
And it ends up hitting my site from 10 different addresses pretty close together. (maybe because my site is a robot's wet dream with a huge amount of images?) I had to deny it. ANyone else have problems with this project?
[edited by: encyclo at 8:20 pm (utc) on Mar. 24, 2008]
[edit reason] fixed side-scroll [/edit]
The Boitho robot does follow the robot exclusion protocol, and should not crawl pages that are exclude by:
User-agent: boitho.com-dc
Disallow: /
However we, as most crawlers, do cache robots.txt files. So if you don't see a request for a robots.txt right before the request to a page, this means we use the earlier version of the robot.txt file. It will also take some time from you make changes to your robots.txt file until we discover them.
If you have any question you think I can answer, don’t hesitate to email us on: tech [att] searchdaimon [dot] com .
Regards
Runar Buvik
Could you elaborate on how Boitho got an initial list of my pages?
I whitelist all bots so Boitho was blocked by default when it first accessed my site on 03/31/2007, which is the first reference I have in my logs, yet it was aware of many pages, pages that people on other sites wouldn't link to, pages that only other search engines would index.
Additionally, some of my page names are dated and Boitho was trying to access pages dated a month before it's first access.
So my question is, how could Boitho know about pages dated before your first attempted access (which was blocked) that only exist in other search engines?
Thanks in advance for any light you can shed on this.
[edited by: incrediBILL at 8:19 pm (utc) on Mar. 21, 2008]
Maybe support for the robots.txt standard is fixed now, but I categorically dislike all distributed UAs due to the lack of accountability. Looking through the Boitho index, it appears to be stuffed with spam anyway.