homepage Welcome to WebmasterWorld Guest from 54.237.184.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
Forum Library, Charter, Moderators: mademetop

General Search Engine Marketing Issues Forum

    
Using "noarchive" to combat content theft
Risks and rewards?
encyclo




msg:229838
 3:38 pm on Jan 24, 2005 (gmt 0)

I have come across a case of content theft from a website I control. I believe that the contents are being copied from the Google cache rather than directly from the original website. Even if they aren't I have nevertheless decided to add
<meta name="robots" contents="noarchive"> to the site as an experiment in order to reduce the number of copies of the site around. It seems to me that the SERPS cache has very little value to the general user, and several disadvantages to the site owner - and not just the fact that it could be considered a copyright infringement in itself. But I have a few questions:

1. I am aware that Google and Yahoo claim that using the noarchive value will not affect spidering or ranking. Does anyone have any hard evidence to the contrary? The site in question does not use cloaking in any form.

2. I believe both Yahoo and Google respect the above meta tag, but the last I heard MSN does not. Is that still the case? I may ban msnbot completely if they continue to show a cached version.

3. If there is no disadvantage to using noarchive, why isn't everyone doing it? The only large site I know is using noarchive is WebmasterWorld.

 

too much information




msg:229839
 3:51 pm on Jan 24, 2005 (gmt 0)

You could also use a javascript to redirect to your site if it is being displayed by on another URL such as a cache version. I had a frame busting script that actually did that very effectively. Then it doesn't matter if it is archived or not.

The good thing about archiving is that it gives you proof that you had the content first. Check archive.org and you can usually find the approximate time that someone added your content to their site, if it has been crawled before.

encyclo




msg:229840
 4:03 pm on Jan 24, 2005 (gmt 0)

Thanks for your reply, TMI. I'm aware of the Javascript trick, but it is easy to bypass (just toggle Javascript).

archive.org

Banned them, too! ;) I can easily prove the ownership of the content. However, I'm not really asking about methods to stop the theft, rather whether using "noarchive" is accompanied by particular problems when the site is not cloaked.

walrus




msg:229841
 4:28 pm on Jan 24, 2005 (gmt 0)

<Check archive.org and you can usually find the approximate time that someone added your content to their site, if it has been crawled before>
Good advice, I try that sometimes too, but probably getting less reliable as i bet many inscrupulous webmasters are getting wise to this and blocking archive.org robots on purpose now just so we cant tell when they did it.

balam




msg:229842
 5:14 pm on Jan 24, 2005 (gmt 0)

> last I heard MSN does not. Is that still the case?

I'm using "noarchive" and MSN is respecting it.

<added>
Google will only add a "freshdate" if they are allowed to cache the page; MSN adds a freshdate regardless; Yahoo doesn't have freshdates.
</added>

Imaster




msg:229843
 6:07 pm on Jan 24, 2005 (gmt 0)

Do you think noarchive tag will solve the hijack problem?

encyclo




msg:229844
 6:18 pm on Jan 24, 2005 (gmt 0)

Do you think noarchive tag will solve the hijack problem?

No. The noarchive is far from a panacea, and the particular situation is quite complex (not just the current wave of scraping, this is more of a plagiarism problem).

With the noarchive I simply want to reduce the number of places where the site content is available. I can keep an eye on visits to the site itself, but I am surrendering control when the content appears on a third-party site such as in a SE cache.

keyplyr




msg:229845
 6:18 pm on Jan 24, 2005 (gmt 0)

I do not know if the noarchive tag solves the hijack problem,
but I do know that using it has not interfered with (at least one of) my pages PR or SERP placement.

encyclo




msg:229846
 6:36 pm on Jan 24, 2005 (gmt 0)

balam:
I'm using "noarchive" and MSN is respecting it.

That's good news.

keyplyr:
I do know that using it has not interfered with (at least one of) my pages PR or SERP placement.

Even better news!

So, is it simply inertia which is stopping people from using noarchive, or is it continued concern that it will cause problems with ranking? Is there any advantage for the site owner to allow the search engines to offer a cached version a site?

iThink




msg:229847
 6:40 pm on Jan 24, 2005 (gmt 0)

I can keep an eye on visits to the site itself

Sounds interesting...but how many sites do you own or manage? Can you really keep an eye on the visits to the site and know the intention of visitors? Some of them may be genuine visitors/customers and some may be looking to steal something from your site. Some of those thieves may be using proxies and programs that change proxies for a browser in just a few mouse clicks. Some of the more expert thieves may be using automated bots to extract content from your site and so on.

I may ban msnbot completely if they continue to show a cached version.

I don't know what your site(s) are about but I do know that MSN traffic is the best converting traffic for the b2c sites. So banning the msnbot will be like throwing the baby with the water, IMHO of course.

encyclo




msg:229848
 6:50 pm on Jan 24, 2005 (gmt 0)

iThink, I agree with many of your comments. However, even if it is sometimes hard to detect rogue bots copying content from your site, if they are copying from the Google/Yahoo/MSN cache you've got no chance. At least if they have to come to my server I can put rewrites in place to stop blatant abuse (by banning certain IP addresses and user agents). As I said, noarchive is not a solution to the problem in itself, simply part of the arsenal.

For the MSN question, balam has confirmed that the noarchive tag is respected, so I've got no problem in allowing their bot in. It would be a shame to ban them, and you have said.

iThink




msg:229849
 7:12 pm on Jan 24, 2005 (gmt 0)

I'm still not convinced that noarchive can be used to block rogue bots and I am not saying that just for the sake of saying that. Let us say your site has 2000 pages and I want to make a copy of all the content on your site. My web hosting server with a 10mbps connection can make a copy of all the text content of your site in just 2-3 minutes. Once I have a copy of your site, I don't need to visit your again. You can ban my IP, I no longer care. If I ever need to copy your site again then I'll use another server with different IP etc. So noarchive simply doesn't help.

So your effort will be like closing the stable door after the horse has bolted.

Only 100% sure solution of this problem looks like the use of cloaking but then that opens another pandora box.

HitProf




msg:229850
 7:43 pm on Jan 24, 2005 (gmt 0)

Is there any advantage for the site owner to allow the search engines to offer a cached version a site?

I can think of two:
1) when your site is slow or down, some visitors will still be able to see your content
2) easy detection which version of a page has been indexed old, new, newest, etc)

jim2003




msg:229851
 7:50 pm on Jan 24, 2005 (gmt 0)

Hello,

Assuming the site contains internal links I think having the site cached can be an advantage. As a search engine user, I use the cached versions to take advantage of the keyword highlight feature. Often when I find what I am looking for on the initial cached page, I subequently visit other pages in the site.

Regards,

musicales




msg:229852
 8:13 pm on Jan 24, 2005 (gmt 0)

another use of the cache - emergency backup when your server and backup crash at the same time (; (unfortunately I'm not entirely joking!)

I can't see the noarchive tag at webmasterworld..?

Imaster




msg:229853
 8:28 pm on Jan 24, 2005 (gmt 0)

I can't see the noarchive tag at webmasterworld..?

Probably a server-side coding.

encyclo




msg:229854
 12:59 am on Jan 25, 2005 (gmt 0)

I agree there are some marginal aspects of the "Cached" link which may be useful, but I'm not convinced that any are particularly compelling. As I said in my original message, I've added noarchive to the entire site (apart from a few fluff pages where I added noindex), and I'll see what happens.

On the front page description to this thread, there is the comment:

the NoArchive tag has become a requirement for most commercial sites

Are there many here who use noarchive systematically on some or all sites in particular sectors? Do you use it site-wide or just on select pages?

sun818




msg:229855
 4:49 am on Jan 25, 2005 (gmt 0)

> Do you use it site-wide or just on select pages?

If you sell a service or widget, it is a requirement, at least on your "buy it now" pages. There's been discussion here before where clients will use cached to make purchases or use it as a bargaining tool to receive services for less money. If you also carry inventory information, buyers will buy the widget from the cached page that is now out of stock.

Visit Thailand




msg:229856
 4:59 am on Jan 25, 2005 (gmt 0)

We use the noarchive tag on all pages on all sites. We started with just the Google tag but then had to change as more SE's started caching.

Have not seen any negative results from doing this.

internetheaven




msg:229857
 6:40 am on Jan 25, 2005 (gmt 0)

I have also found MSN disobey most of the regular robot rules (though I'm sure if confronted they will have some excuse prepared) but since adding this:

<meta name="MSSmartTagsPreventParsing" content="TRUE">

they seem to be behaving themselves a bit more. Maybe it says "I know what you're up to and I don't like it" to the bot and he pays attention to the other tags ...

... p.s. if you don't hear from me again, check Bill Gates' alibi for the night I was murdered ...

keyplyr




msg:229858
 9:41 am on Jan 25, 2005 (gmt 0)


RE: <meta name="MSSmartTagsPreventParsing" content="TRUE">

This meta tag is useless. M$ changed their minds and reversed their plans about implementing the extended features of SmartTags when the web community voiced an very loud objection.

Although the SmartTag framework still exists for those using Windows/IE and there is a possibility that some level of this technology may be put into use in the future, currently the tag does absolutely nothing and everyone I know who initially installed it across their website, has since removed it.

However, if you have evidence that the tag actually does something, I think we'd all be interested.

balam




msg:229859
 2:28 pm on Jan 25, 2005 (gmt 0)

> I have also found MSN disobey most of the regular robot rules

Now I haven't gone through all my logs, but I'm not aware of MSNbot having done anything untowards (on my site - YMMV!). I also just checked their index and found nothing that shouldn't be there. How recently did you catch them misbehaving, internetheaven?

> they seem to be behaving themselves a bit more.

I highly suspect that this is a coincidence, given that...

> This meta tag is useless.

...and it was/is a tag that works against Microsoft technology. It flys in the face of logic, but I do realize we're talking about Microsoft.

> Do you use it site-wide or just on select pages?

Myself, I do not use it on select pages. That is, most all pages have the attribute but a select few are cachable by Google. This is for the/any psychological effect a freshdate may have on a searcher.

SEO means playing tech games with the engines and mind games with searchers.

encyclo




msg:229860
 3:05 pm on Jan 25, 2005 (gmt 0)

An update on the noarchive: I added it to the site early yesterday morning, and the site was visited by both Googlebot and msnbot (Yahoo don't love me!) during the day. This morning, MSN Search is showing the same pages in the same positions with a fresh date and no cache link. Absolutely no movement whatsoever, no negative effect at all for MSN at least. Google isn't showing anything new as yet.

It's obviously very early days, but things are looking good so far.

balam




msg:229861
 3:19 pm on Jan 25, 2005 (gmt 0)

> Absolutely no movement whatsoever, no negative effect at all for MSN at least.

Excellent news!

encyclo




msg:229862
 1:13 am on Jan 26, 2005 (gmt 0)

I have also found MSN disobey most of the regular robot rules

I've never had any problem with msnbot, either with robots.txt or robots meta tags. You might want to validate your robots.txt to make sure there's no problem.

For my site, Google are now omitting the Cached link next to some pages. Again, just like MSN, there has been no perceptible change whatsoever in ranking.

aleatrix




msg:229863
 8:25 pm on Jan 29, 2005 (gmt 0)

I would like to try add "noarchive" to my ebusiness pages, but I'm not a code expert so I'm not sure where to put it. Does it go in the html code in the head or where?

encyclo




msg:229864
 8:28 pm on Jan 29, 2005 (gmt 0)

aleatrix, just place the following between your
<head> and </head> tags:

<meta name="robots" content="noarchive">

vabtz




msg:229865
 3:40 pm on Mar 10, 2005 (gmt 0)

hate to bring this up from the dead but have you seen any negative effects yet?

oddsod




msg:229866
 3:48 pm on Mar 10, 2005 (gmt 0)

And, to go slightly Off topic - how do you do this on the server side a la webmasterworld?

encyclo




msg:229867
 4:04 pm on Mar 10, 2005 (gmt 0)

have you seen any negative effects yet?

No problems: I've dropped a couple of slots on my primary keyword, but I've got a bit of a 302 problem from a couple of directories and an authority site has jumped ahead of me with two pages on my subject (I'm still in position 3 and 4). Other keywords are fine if not better, and traffic from Google and MSN has increased. The site has always done badly in Yahoo, no change there.

Of course, the only search engines I know support this tag are Google, MSN and Yahoo - it won't block caching by other sites.

how do you do this on the server side

Cloaking ;) Either add it for known bots, or do what I'm doing for a new forum I'm launching soon and make it appear only when a user is not logged in.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / General Search Engine Marketing Issues
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved