Forum Moderators: phranque

Message Too Old, No Replies

Lawyers Using Your Own Web Site Against You

Your Website May Incriminate You

         

incrediBILL

11:13 pm on Aug 23, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



PITFALLS OF SAVING YOUR SITE FOR POSTERITY

Search engines automatically cache your pages and something called the Internet Archive, or Wayback Machine, also comes along and makes a permanent copy of your site for "posterity". The problem starts when you realize you may have content on your web site that could result in legal issues. You may act quickly to resolve those issues yet the problems still remain without your knowledge because you didn't act as quickly as all the robots crawling your site.

Unfortunately, legal beagles love that your site was saved for "posterity" when gearing up to file a lawsuit so although you've already done the right thing by cleaning potentially harmful things off your site, the tireless automatons crawling the internet have made sure there's plenty of evidence and the next thing you know, you're about to get hung out to dry.

If you think the lawyers aren't technically savvy, think again:

Browsing a party's Web site will only show the information that the Web site owner currently wants visitors to see. Sometimes, the most valuable information about an opposing party is the information that has been changed or removed. Fortunately, there are ways to see older versions of Web pages. Pages that were changed recently can be viewed through Google's cache feature. Pages that were changed months or years ago may be available through the Internet Archive, also known as the Wayback Machine.

[law.com...]

Not only can they find your content, they do it under cloak without your knowing about it!

Viewing these older versions of Web pages avoids the privacy risks discussed above: The copied pages are not on the company's Web site, so the company has no record of the researcher's activities.

You can forget your rights, just throw them out the window, because the history of your website is already busy squealing on you without your knowledge or permission.

HOW DO YOU PROTECT YOUR SITE FROM HISTORICAL SNOOPING?

Obviously the simplest way is to keep your nose clean so nobody has a reason to be snooping in the first place.

However, this is the internet and you have to OPT-OUT of things to protect your rights.

Here's a few preventative ways to stop your website from being archived and being used as a snitch:

USE NOARCHIVE

Make sure you include the NOARCHIVE meta tag in each web page so that there is no cache in any of the major search engines.

<META NAME="ROBOTS" CONTENT="NOARCHIVE">

USE ROBOTS.TXT

Block all of the archive site spiders, such as used by the Internet Archive, in your site's robots.txt file with an entry as follows:

User-agent: ia_archiver
Disallow: /

The Heritrix software [crawler.archive.org] used by the Internet Archive is Open Source which means there are more archives out there and possibly using deviations of Heritrix that ignore robots.txt and cloak their access to your site.

HELP FOR HOSTED BLOGGER ISSUES

If you're running a blog hosted on a 3rd party service like Blogger or WordPress, your options may be limited to just embedding NOARCHIVE which the Internet Archive ignores, meaning anyone running stock Heritrix code would also ignore by default.

The only way you can exclude your site, according to their site [archive.org], is to contact them directly. Obviously an insufficient amount of businesses and sites in general are aware of the perils posed by the Internet Archive or they would honor the NOARCHIVE tag for those sites with limited access and no robots.txt just to avoid a flood of emails.

OTHER POTENTIAL RISKS

Snap.com has taken screen shots of every web page, then Ask started taking limited screenshots as well as a some new completely graphical search engines like SearchMe. Some screen shots have minimal resolution too tiny to read but others, like Snap and SearchMe, are big enough you can read, and these too are called evidence in a lawsuit. Even the tiniest thumbnail can still show a licensed trademark being used without permission.

Some of the social bookmarking sites that allow large chunks of content to be copied such as Kaboodle, Jeteye, Eurekster, some using tools like Heritrix (see above), to make small archive copies of specific content.

SUMMARY

Obviously there's no way you can completely stop anyone from making copies of your site but it may pay by being diligent in keeping many of these technologies off your site that provide any form of archives.

This is just another form of insurance that could, in the end, save your business, your house, your car, your family...

buckworks

12:01 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I call it "educating".

phranque

12:41 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



or advocacy, perhaps.
it's far from an organized operation...

Quadrille

12:57 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I call it "educating".

or advocacy, perhaps.

If it was just this one thread, I'd join you :)

wheel

1:35 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



lol. the more I think about it, the more I'm inclined to agree with incredibill's concerns in general. In fact, next order of business when I have a moment, block the wayback machine from all my sites. I've already started to block the Google cache.

Considering how I use the wayback machine - competitive research generally - copyright or other legal issues aside, I just don't need my competition snooping on the history of my site.

skipfactor

1:46 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>>I think it's weird that some people in this thread are trying to spin incrediBILL's posts to make him look on the fringe, paranoid or having something to hide. I disagree, and I actually think they're the ones who are overreacting.

>>lol. the more I think about it

I don't get it either. Anybody that has really run a blog, forum, or been threatened/sued has to think of this on almost every sketchy post?! Am I insured with no cache/no wayback? Ah yes, no lawyer calls for now; life is good, return to beach activities...

phranque

1:50 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If it was just this one thread, I'd join you

did it ever occur to you that the moderator of the Spider ID and Cloaking forums [webmasterworld.com] might be involved in most such discussions on WebmasterWorld?

Quadrille

2:15 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes thanks. I noted the connection, too ;)

mr_chill

2:34 am on Aug 26, 2008 (gmt 0)

10+ Year Member



let me get this straight, I use this meta, <META NAME="ROBOTS" CONTENT="NOARCHIVE"> on each page and, User-agent: ia_archiver
Disallow: / in the robots.txt and I'm safe? I agree what I publish is mine and no one elses to display whether they call it "archive or cache or hey we have it here, why go there". Those who say "don't do it in the first place" must have never made a mistake. Ops! there I said it, I make mistakes, it must be nice to be so perfect as to "never have done it in the first place".

bluntforce

3:48 am on Aug 26, 2008 (gmt 0)

10+ Year Member



I've used Internet Archive more than once as a "quick and dirty" method of defining that I published before an infringing site. It's fairly clear that is has value for that purpose, if no other.

I'm sure it would be possible for an attorney, or their researcher, to find something that might be potentially actionable on my site by using the internet archive. That's a calculated risk, which is part and parcel of running an internet business.

Do I really care if my competition examines the history of my site? Not really. If they care to dissect and attempt to duplicate what I was doing 3 years, 1 year or even 3 months ago, good for them. I try to envision where I need to be 3 months, 1 year or 3 years from now.

I respect the opinions of everyone who has posted, I've learned some new things, but in a way this is reminding me of why I can't read the Google Search News threads too much. It makes ME paranoid.

skipfactor

4:27 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



++"JUMP!"++

incrediBILL

4:59 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do I really care if my competition examines the history of my site? Not really. If they care to dissect and attempt to duplicate what I was doing 3 years, 1 year or even 3 months ago, good for them. I try to envision where I need to be 3 months, 1 year or 3 years from now.

Actually, it tells your competition what path you took to get where you are today.

They can see when you try various things, rollback, etc. so why let them avoid any mistakes you made along the way?

FWIW, I too have used the Archive once upon a time in an infringing case but it really doesn't matter because all the DMCA requires is your affidavit claiming it's yours to get the ball rolling. Since disabling the archives I've had no trouble getting stolen content removed because staking a legal claim and signing your name to it carries quite a bit of weight until the other side makes a counter claim which thieves rarely tend to do.

bluntforce

5:53 am on Aug 26, 2008 (gmt 0)

10+ Year Member



@incrediBILL:

My path is varied, if they choose that path, there are many nuances that surely won't show up on an Archive view. They're welcome to play chase.

Archive use works well when not dealing with SE DMCA. Industry specific hosting sites can understand and deal with infringers quickly when presented with independent third party "evidence".

We all have our different takes on what is critical, and what isn't, which is why I referenced the calculated risk we all take.

I appreciate you bringing the topic up, whether I agree or disagree, it's a valid concern that should be discussed.

Tastatura

8:17 am on Aug 26, 2008 (gmt 0)

10+ Year Member



Just a side note tidbit:
For example, web archive will not store your image if you have base tag and relative path to images...such as

<BASE HREF="http://example.com">
and
<IMG SRC="/images/myimage.jpg">

i.e archive will not save and display your image, and people "snooping" will not be able to see it (but source code for it will still be there)

Quadrille

8:43 am on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One possible consequence of this campaign, if it succeeds, is that Google may have to rethink their opt out policy.

It has been hinted (I have no idea how realistically), that wholesale deprivation of Google cache will make Google's life difficult; the logical next step for Google would then be to treat 'cache opt out' as 'Google opt out'. That would be fun :)

Of course a much better SE future would have webmasters opting in - just like the early days - but hopefully with options, such as cache / no cache / no images / sitemap / whatever / ...

Such a system would make it much easier for the SEs to use positive encouragement to webmasters, rather than the current negative graded penalty system, and webmasters would know exactly where they stand. So a future SEO might be saying "I can get you much higher than you are, but you've opted out of cache and you don't have a sitemap, so you'll not make the front page' etc., etc.

sandyk20

12:12 pm on Aug 26, 2008 (gmt 0)



Make sure you include the NOARCHIVE meta tag in each web page so that there is no cache in any of the major search engines

Is this really helpful ?

Lord Majestic

12:48 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Last night I had a nightmare dream that someone stolen my content :(

Must not succumb...

BeeDeeDubbleU

3:26 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The week before last I sent out 70 emails to companies who had stolen my terms and conditions more or less verbatim. I had archive.org to prove this.

pageoneresults

3:50 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It has been hinted (I have no idea how realistically), that wholesale deprivation of Google cache will make Google's life difficult; the logical next step for Google would then be to treat 'cache opt out' as 'Google opt out'. That would be fun.

No it wouldn't. That would be a disaster for the SEs and those who have been following protocol.

There are very logical reasons why many would not want their pages cached. Product pages are primary candidates. Pages where content changes frequently, etc. We just haven't really given it much thought over the years as we were in the mindset of getting "everything" indexed, cached, whatever may assist in maintaining visibility within the SEs.

These past couple of years have been a real eye opener for me as I've ventured into different black holes and learned a little in the process. Enough for me to decide that it was time to start blocking the indexing and/or caching of content.

There are too many topics about "this" and "that" with the words "cache" and "archive" mentioned for me not to take notice. And, instead of trying to backtrack and understand what may be going on, I'm just going to opt out for now since the option is available.

Until I see in writing, from the search engines, that they "guarantee" there is no funny business going on with cache and/or Internet archives, I'm taking a proactive stance.

Bad data push?
Oh-oh, there goes 100,000 page indexed and into the cache with a 30% discount on our entire product line.

Author gone postal?
Oh-oh, there goes an entire topic indexed and cached for all to see.

Glitch in user profiles?
Oh-oh, how the hell did that happen? When did it happen? Lawsuit incoming...

The list goes on and on. I'm concerned for the integrity of my clients data, I really am. Not because there is anything fishy going on there but, because of what may happen to it "after the fact".

incrediBILL

3:54 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



deprivation of Google cache will make Google's life difficult

How so?

Showing cache pages doesn't change how the search works.

The only thing it might make difficult for Google or any other SE is to retain OUR customers longer. Search engines were always supposed to help visitors find us faster, not serve up our content to those visitors. That's when the SE crosses the line between helping a website and competing with the website for it's own customer. The longer the SE holds the customer the longer they have a chance at the PPC revenue.

You think that's fair that they use your site in that manner with your own cache?

Maybe you don't make your money online and it doesn't matter to you that the biggest fish in the pond is trying to become the actual pond itself, but it matters to me!

So a future SEO might be saying "I can get you much higher than you are, but you've opted out of cache

If it actually comes to that you know we're being literally blackmailed to turn over control of our copyright because the cache is always there, it's whether we allow it to be displayed or not.

Since you brought it up, Google's description of cache even alludes that it needs the cache to judge the relevance of the page, almost implying that if you disable cache it can't tell the relevance which isn't true at all.

[Google.com...]

Google takes a snapshot of each page examined as it crawls the web and caches these as a back-up in case the original page is unavailable. If you click on the "Cached" link, you will see the web page as it looked when we indexed it. The cached content is the content Google uses to judge whether this page is a relevant match for your query.

I shut off all my cache 3 years ago and it never hurt my site whatsoever.

So did WebmasterWorld and it still dominates the SERPs.

tangor

5:49 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I've just implemented the .htaccess suggestion (earlier in the thread). Opted out of wayback several years ago... my earlier sites on hosts like geocities etc archived in wayback were competing with my final website. Had not thought of google cache until recently when log examinations kept turning up odd url entries.

My question is: what time frame would be reasonable to expect google cache (and others) to clear?

I have nothing to hide, but site has a different structure than previous versions and redirecting all those changes would be PITA.

koan

6:22 pm on Aug 26, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The cached content is the content Google uses to judge whether this page is a relevant match for your query.

Google can still use a cache of our pages for its own purposes, just not display it publicly if we opt out. The phrasing may suggest otherwise, but like you said, I really doubt it has any effect on ranking.

Busynut

4:07 am on Aug 27, 2008 (gmt 0)

10+ Year Member



I really appreciate the info in this thread - thank you, incrediBILL. I'd opted out of ia archiver a long time ago for my hobby sites - not because I had anything to 'hide' really - well, actually I did want to hide something - I was embarrassed how silly my sites looked "way back" then - my graphics were childish, everything about my sites was unprofessional. Still is, probably. No matter, though - it is my choice to attempt some degree of control over my own stuff if I can - although it gets harder every week.

bluntforce

6:49 am on Aug 27, 2008 (gmt 0)

10+ Year Member




"Paranoia runs deep
Into your heart it will creep
It starts when you're always afraid
Step outta line, the Man come and take you away"

From memory, Buffalo Springfield.

greenleaves

6:21 pm on Aug 27, 2008 (gmt 0)

10+ Year Member



untill recently, I was sitting on the fence on this one.

While I was tempted to block archiving and caching of my sites, I was lazy and not really convinced I should.

Then...

One of my sites got hacked and a lot of scraped content was put on certain parts of my site (directories and thousands of pages were created).

I was quick to change passwords and errase the bad content, but when I check google, they had just as quickly picked up on the scraped content and it is now in their archive.

God knows when they will update it and I don't have to time to waste trying to get them to take it down... I'll just let it happen naturally.

But man, I wish I wouldn't have sat on that fence. Thank you to those "campaigning" agaisn't the cache... I wish I would have listened earlier.

amznVibe

4:57 am on Aug 28, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



What *really* bothers me is that ia_archiver (run by amazon for alexa) completely ignores the NOARCHIVE meta. You either have to block it entirely or accept it's cache. Why don't they just fix that to behave?

At least they aren't running spoofing stealth agents (yet) like Google and MSN

This 55 message thread spans 2 pages: 55