Boitho ignoring robots.txt

Forum Moderators: open

Message Too Old, No Replies

Boitho ignoring robots.txt

Megaclinium

5:10 pm on Feb 25, 2008 (gmt 0)

I was getting alot of robots hitting my site from addresses 129.241.50.* and 129.241.104.* that resolves to Norwegian Institute of Science and Technology under Ripe who-is.

The log shows "http://www.boitho.com/dcbot.html". Page says it's creating thumbnail indexes of all the web.

It didn't seem very well mannered, hitting directories that I had renamed, and then when I added these to disallow, it never even opens robots.txt, even tho the site claims you can add an entry to disallow it.

And it ends up hitting my site from 10 different addresses pretty close together. (maybe because my site is a robot's wet dream with a huge amount of images?) I had to deny it. ANyone else have problems with this project?

wilderness

6:07 pm on Feb 25, 2008 (gmt 0)

Here's some old threads [google.com]

[edited by: encyclo at 8:20 pm (utc) on Mar. 24, 2008]
[edit reason] fixed side-scroll [/edit]

Megaclinium

7:19 pm on Feb 25, 2008 (gmt 0)

Thanks! that was helpful

runarb

2:12 pm on Mar 3, 2008 (gmt 0)

I am CTO at Boitho. I am sorry to hear that we have caused you problems. The Boitho bot is a distributed web crawler, and is being used to crawl the whole internet, to build the index that boitho.com and our partners use for searching.

The Boitho robot does follow the robot exclusion protocol, and should not crawl pages that are exclude by:

User-agent: boitho.com-dc
Disallow: /

However we, as most crawlers, do cache robots.txt files. So if you don't see a request for a robots.txt right before the request to a page, this means we use the earlier version of the robot.txt file. It will also take some time from you make changes to your robots.txt file until we discover them.

If you have any question you think I can answer, don�t hesitate to email us on: tech [att] searchdaimon [dot] com .

Regards
Runar Buvik

incrediBILL

8:13 pm on Mar 21, 2008 (gmt 0)

Hi Runar,

Could you elaborate on how Boitho got an initial list of my pages?

I whitelist all bots so Boitho was blocked by default when it first accessed my site on 03/31/2007, which is the first reference I have in my logs, yet it was aware of many pages, pages that people on other sites wouldn't link to, pages that only other search engines would index.

Additionally, some of my page names are dated and Boitho was trying to access pages dated a month before it's first access.

So my question is, how could Boitho know about pages dated before your first attempted access (which was blocked) that only exist in other search engines?

Thanks in advance for any light you can shed on this.

[edited by: incrediBILL at 8:19 pm (utc) on Mar. 21, 2008]

keyplyr

2:12 am on Mar 22, 2008 (gmt 0)

As I recall, several months ago boitho.com-dc ignored my robots.txt where it was disallowed, hence it is now banned via alternative methods.

Maybe support for the robots.txt standard is fixed now, but I categorically dislike all distributed UAs due to the lack of accountability. Looking through the Boitho index, it appears to be stuffed with spam anyway.

Megaclinium

8:17 pm on Mar 24, 2008 (gmt 0)

I see robots doing directories in the web logs so that is probably where they would find about un-indexed pages?
- the entries ending in ?N=A, N=D or something.. not sure what the actual entries mean.

(unless you have directory hidden or protected)