Diigo caching pages in violation of meta tags & cloaking bot

Forum Moderators: open

Message Too Old, No Replies

Diigo caching pages in violation of meta tags & cloaking bot

KenB

4:01 am on Jan 7, 2010 (gmt 0)

I've discovered a pseudo "research" site called Diigo that caches pages of websites ignoring meta-tag instructions prohibiting this process (think Alexa with a commercial bent). They also hide behind an improperly identified bot.

Their bot identifies itself as Mozilla/4.5 and the last time it hit my site it came from the class C range 216.237.119.#*$!

A WHOIS of the domain indicates the range belongs to a server farm and thus is safe to block.

I tried contacting them but couldn't get past the captcha on the contact form (it didn't work right with Firefox).

Proper bot identification = No
Robots.txt = No
Meta tags = No
Illegitimate = Yes

Pfui

7:05 am on Jan 7, 2010 (gmt 0)

In addition to diigo.com, be on the lookout for the following...

From Nov., 2009, and last week:

mail.diigomail.com
HttpComponents/1.1

robots.txt? NO
Reqs: HEADs

And from 2008 (Oct.), using a laughingly bogus UA:

mail.diigomail.com
Mac Safari

robots.txt? NO
Reqs: n/a

FWIW:

Last year, they promo'd themselves as: "Diigo - Web Highlighter and Sticky Notes, Social Bookmarking" etc. The site came to my (really, really annoyed) attention when a visitor copy-pasted humongous quantities of our copyrighted content into Diigo's quasi-communal bookmark lists/recs/whatevers.

(The person hijacked hidden and other graphics galore in the process and the errors at our end caught my eye. I subsequently persuaded him to remove everything. He was miffed. I was much more so.)

Nowadays, I'm not sure what Diigo is up to with its atypical, stealth and/or fake UAs, and HEAD reqs. Link-checking, a la Facebook? But because of what they're doing, bot-wise -- and what they allowed at least one user to do, post-wise -- well, I'll only accept referers from diigo.com so I can then track back and check for problems.

KenB

5:06 pm on Jan 7, 2010 (gmt 0)

Here is something else that makes them really bad players, when they cache pages, they break JavaScripts that break a page out of frames so that they can keep the cached page within their own frames. To that end, I spent some time figuring out how exactly to reclaim users and get them to uncached and unframed pages directly from my server. This solution requires that you already make at least one reference to a javascript file on your server from the header section of your pages. In my case I make a call to a file called Primary.js, you'll need to change the file reference in the .htaccess instructions to match your JS naming scheme.

Add to .htaccess:


# BLOCK SITE SCRAPING
#=====================================
RewriteCond %{HTTP_REFERER} diigo\.com [NC]
RewriteRule Primary(.*)\.js$ /JavaScripts/Diigo.js
RewriteCond %{HTTP_REFERER} diigo\.com [NC]
RewriteRule \.(css�png�jpg�gif)$ - [F,L]

Create Diigo.js file with following instructions:

document.write('<'+'script'+'>'+'var base=document.getElementsByTagName("base")[0];var l=base.getAttribute("href");window.location=l'+'<'+'/script'+'>');

Upload both files to your server and there you go. Any pages cached by Diigo will break out of their frames AND redirect to the actual page on your server. The reason this works is that they insert a base tag at the top of their pages.

Now we just need someone to sue them into non-existence for copyright infringement. I think there would be a pretty good case for it since they ignore meta tags and robots.txt, they alter page source code to keep page from breaking out of their frames and they cloak their bots' user agent string to prevent detection. All we need would be for them to develop a workaround to the above code and we'll have them dead to rights for flagrant copyright violations.

wilderness

6:18 pm on Jan 7, 2010 (gmt 0)

It would NOT hinder your visitors (IMO) to deny access of the provider all the way up to the backbone, which is a dead domain name (via multiple searches at ARIN and SE's.)
In fact, it appears to me that the backbone is owned by the sub-net.