homepage Welcome to WebmasterWorld Guest from 54.242.112.71
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
boia.org
tools request, ignore robots.txt
Pfui




msg:4404643
 10:52 pm on Jan 6, 2012 (gmt 0)

The org's aims are honorable, but their tools dishonor robots.txt. Every. Single. Time.

Examples from the past few days where the robots.txt they Get is always, and only --

User-agent: *
Disallow: /

-- and is immediately, and repeatedly ignored:

www.boia.org
BOIA-Scan-Agent/LC 1.0 (www.boia.org)
06:55:17 /robots.txt
06:55:18 /homepage.html

boia.org
BOIA-Scan-Agent/LC 1.0 (www.boia.org)
06:12:25 /robots.txt
06:12:26 /homepage.html

www.boia.org
LinkChecker/7.3 (+http://linkchecker.sourceforge.net/)

01/05 15:55:44 /robots.txt
01/05 15:55:45 /homepage.html
01/05 15:55:52 /robots.txt
01/05 16:09:58 /robots.txt
01/05 16:09:59 /homepage.html
01/05 16:10:00 /robots.txt

Note hits from both Hosts:

www.boia.org
= 98.174.83.170
= Mendon Cox Communications

boia.org
= 98.191.56.241
= Cumberland Cox Communications

Apparent referrers (by registered users?) are typically .edu, and also repetitive. But I seriously doubt individuals are sitting there entering my site's home page into boia.org's 'free scan' box over and over and over again at all, let alone for months on end.

Bottom Line:

Regardless of Host/IP, UA, and/or REF, robots.txt is always ignored.

 

DeeCee




msg:4404717
 7:06 am on Jan 7, 2012 (gmt 0)

The organizations that think they have an "honorable" purpose are the worst offenders of all.

Info Trackers and Mark Scanners as I usually call them.
The Mark scanners especially, out hunting for trademark abuse and stolen copyrighted images and content.

They think that because they serve a "righteous" purpose they have the right to rip off whole web-sites and every image on a site to check it all. Over and Over.

Mark Monitor is one of the worst of all, although there are quite a few of them. Cyveilance and Name Intelligence (also owns Domain Tools [dot] com) just to mention a couple.

Plus, Mark monitor has now added at least one shell company (recently named "Brand Certified" with different IP ranges) to hide some of their activities behind.

I have added all the IP ranges I know for them to my DNSBL blocks as policy blocks. I do not want to see them, and they are met with nothing but 403's.

I have absolutely no stolen content, and they are welcome to "investigate" by hand using normal human visitors.
But just as I would not let all these "private detectives" into my house to riffle through all my closets and drawers to check if I might own counterfeit Nike's, fancy hand-bags, or have a drawer full of stolen gold-watches, without the police and a court-order at their side, I will not have them steal my server and network bandwidth either, "just in case I might be a thief". They are really abusive, running much faster that any other normal bots.

Having a "righteous" cause does not make network and server bandwidth theft legal.
By Texas law, I could haul them off to court for hacking and illegal access. So also in many other states that have similar laws against "unauthorized" access to a network or server.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved