Forum Moderators: open
Their bot identifies itself as Mozilla/4.5 and the last time it hit my site it came from the class C range 216.237.119.#*$!
A WHOIS of the domain indicates the range belongs to a server farm and thus is safe to block.
I tried contacting them but couldn't get past the captcha on the contact form (it didn't work right with Firefox).
Proper bot identification = No
Robots.txt = No
Meta tags = No
Illegitimate = Yes
From Nov., 2009, and last week:
mail.diigomail.com
HttpComponents/1.1
robots.txt? NO
Reqs: HEADs
And from 2008 (Oct.), using a laughingly bogus UA:
mail.diigomail.com
Mac Safari
robots.txt? NO
Reqs: n/a
FWIW:
Last year, they promo'd themselves as: "Diigo - Web Highlighter and Sticky Notes, Social Bookmarking" etc. The site came to my (really, really annoyed) attention when a visitor copy-pasted humongous quantities of our copyrighted content into Diigo's quasi-communal bookmark lists/recs/whatevers.
(The person hijacked hidden and other graphics galore in the process and the errors at our end caught my eye. I subsequently persuaded him to remove everything. He was miffed. I was much more so.)
Nowadays, I'm not sure what Diigo is up to with its atypical, stealth and/or fake UAs, and HEAD reqs. Link-checking, a la Facebook? But because of what they're doing, bot-wise -- and what they allowed at least one user to do, post-wise -- well, I'll only accept referers from diigo.com so I can then track back and check for problems.
Add to .htaccess:
# BLOCK SITE SCRAPING
#=====================================
RewriteCond %{HTTP_REFERER} diigo\.com [NC]
RewriteRule Primary(.*)\.js$ /JavaScripts/Diigo.js
RewriteCond %{HTTP_REFERER} diigo\.com [NC]
RewriteRule \.(css¦png¦jpg¦gif)$ - [F,L]
Create Diigo.js file with following instructions:
document.write('<'+'script'+'>'+'var base=document.getElementsByTagName("base")[0];var l=base.getAttribute("href");window.location=l'+'<'+'/script'+'>'); Upload both files to your server and there you go. Any pages cached by Diigo will break out of their frames AND redirect to the actual page on your server. The reason this works is that they insert a base tag at the top of their pages.
Now we just need someone to sue them into non-existence for copyright infringement. I think there would be a pretty good case for it since they ignore meta tags and robots.txt, they alter page source code to keep page from breaking out of their frames and they cloak their bots' user agent string to prevent detection. All we need would be for them to develop a workaround to the above code and we'll have them dead to rights for flagrant copyright violations.