boia.org - Crawler, Spider, and User Agent ID forum at WebmasterWorld - WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

boia.org

tools request, ignore robots.txt

Pfui

10:52 pm on Jan 6, 2012 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

The org's aims are honorable, but their tools dishonor robots.txt. Every. Single. Time.

Examples from the past few days where the robots.txt they Get is always, and only --

User-agent: *
Disallow: /

-- and is immediately, and repeatedly ignored:

www.boia.org
BOIA-Scan-Agent/LC 1.0 (www.boia.org)
06:55:17 /robots.txt
06:55:18 /homepage.html

boia.org
BOIA-Scan-Agent/LC 1.0 (www.boia.org)
06:12:25 /robots.txt
06:12:26 /homepage.html

www.boia.org
LinkChecker/7.3 (+http://linkchecker.sourceforge.net/)

01/05 15:55:44 /robots.txt
01/05 15:55:45 /homepage.html
01/05 15:55:52 /robots.txt
01/05 16:09:58 /robots.txt
01/05 16:09:59 /homepage.html
01/05 16:10:00 /robots.txt

Note hits from both Hosts:

www.boia.org
= 98.174.83.170
= Mendon Cox Communications

boia.org
= 98.191.56.241
= Cumberland Cox Communications

Apparent referrers (by registered users?) are typically .edu, and also repetitive. But I seriously doubt individuals are sitting there entering my site's home page into boia.org's 'free scan' box over and over and over again at all, let alone for months on end.

Bottom Line:

Regardless of Host/IP, UA, and/or REF, robots.txt is always ignored.

DeeCee

7:06 am on Jan 7, 2012 (gmt 0)

10+ Year Member

The organizations that think they have an "honorable" purpose are the worst offenders of all.

Info Trackers and Mark Scanners as I usually call them.
The Mark scanners especially, out hunting for trademark abuse and stolen copyrighted images and content.

They think that because they serve a "righteous" purpose they have the right to rip off whole web-sites and every image on a site to check it all. Over and Over.

Mark Monitor is one of the worst of all, although there are quite a few of them. Cyveilance and Name Intelligence (also owns Domain Tools [dot] com) just to mention a couple.

Plus, Mark monitor has now added at least one shell company (recently named "Brand Certified" with different IP ranges) to hide some of their activities behind.

I have added all the IP ranges I know for them to my DNSBL blocks as policy blocks. I do not want to see them, and they are met with nothing but 403's.

I have absolutely no stolen content, and they are welcome to "investigate" by hand using normal human visitors.
But just as I would not let all these "private detectives" into my house to riffle through all my closets and drawers to check if I might own counterfeit Nike's, fancy hand-bags, or have a drawer full of stolen gold-watches, without the police and a court-order at their side, I will not have them steal my server and network bandwidth either, "just in case I might be a thief". They are really abusive, running much faster that any other normal bots.

Having a "righteous" cause does not make network and server bandwidth theft legal.
By Texas law, I could haul them off to court for hacking and illegal access. So also in many other states that have similar laws against "unauthorized" access to a network or server.