|too much information|
The good thing about archiving is that it gives you proof that you had the content first. Check archive.org and you can usually find the approximate time that someone added your content to their site, if it has been crawled before.
Banned them, too! ;) I can easily prove the ownership of the content. However, I'm not really asking about methods to stop the theft, rather whether using "noarchive" is accompanied by particular problems when the site is not cloaked.
<Check archive.org and you can usually find the approximate time that someone added your content to their site, if it has been crawled before>
Good advice, I try that sometimes too, but probably getting less reliable as i bet many inscrupulous webmasters are getting wise to this and blocking archive.org robots on purpose now just so we cant tell when they did it.
> last I heard MSN does not. Is that still the case?
I'm using "noarchive" and MSN is respecting it.
Google will only add a "freshdate" if they are allowed to cache the page; MSN adds a freshdate regardless; Yahoo doesn't have freshdates.
Do you think noarchive tag will solve the hijack problem?
|Do you think noarchive tag will solve the hijack problem? |
No. The noarchive is far from a panacea, and the particular situation is quite complex (not just the current wave of scraping, this is more of a plagiarism problem).
With the noarchive I simply want to reduce the number of places where the site content is available. I can keep an eye on visits to the site itself, but I am surrendering control when the content appears on a third-party site such as in a SE cache.
I do not know if the noarchive tag solves the hijack problem,
but I do know that using it has not interfered with (at least one of) my pages PR or SERP placement.
|I'm using "noarchive" and MSN is respecting it. |
That's good news.
|I do know that using it has not interfered with (at least one of) my pages PR or SERP placement. |
Even better news!
So, is it simply inertia which is stopping people from using noarchive, or is it continued concern that it will cause problems with ranking? Is there any advantage for the site owner to allow the search engines to offer a cached version a site?
|I can keep an eye on visits to the site itself |
Sounds interesting...but how many sites do you own or manage? Can you really keep an eye on the visits to the site and know the intention of visitors? Some of them may be genuine visitors/customers and some may be looking to steal something from your site. Some of those thieves may be using proxies and programs that change proxies for a browser in just a few mouse clicks. Some of the more expert thieves may be using automated bots to extract content from your site and so on.
|I may ban msnbot completely if they continue to show a cached version. |
I don't know what your site(s) are about but I do know that MSN traffic is the best converting traffic for the b2c sites. So banning the msnbot will be like throwing the baby with the water, IMHO of course.
iThink, I agree with many of your comments. However, even if it is sometimes hard to detect rogue bots copying content from your site, if they are copying from the Google/Yahoo/MSN cache you've got no chance. At least if they have to come to my server I can put rewrites in place to stop blatant abuse (by banning certain IP addresses and user agents). As I said, noarchive is not a solution to the problem in itself, simply part of the arsenal.
For the MSN question, balam has confirmed that the noarchive tag is respected, so I've got no problem in allowing their bot in. It would be a shame to ban them, and you have said.
I'm still not convinced that noarchive can be used to block rogue bots and I am not saying that just for the sake of saying that. Let us say your site has 2000 pages and I want to make a copy of all the content on your site. My web hosting server with a 10mbps connection can make a copy of all the text content of your site in just 2-3 minutes. Once I have a copy of your site, I don't need to visit your again. You can ban my IP, I no longer care. If I ever need to copy your site again then I'll use another server with different IP etc. So noarchive simply doesn't help.
So your effort will be like closing the stable door after the horse has bolted.
Only 100% sure solution of this problem looks like the use of cloaking but then that opens another pandora box.
|Is there any advantage for the site owner to allow the search engines to offer a cached version a site? |
I can think of two:
1) when your site is slow or down, some visitors will still be able to see your content
2) easy detection which version of a page has been indexed old, new, newest, etc)
Assuming the site contains internal links I think having the site cached can be an advantage. As a search engine user, I use the cached versions to take advantage of the keyword highlight feature. Often when I find what I am looking for on the initial cached page, I subequently visit other pages in the site.
another use of the cache - emergency backup when your server and backup crash at the same time (; (unfortunately I'm not entirely joking!)
I can't see the noarchive tag at webmasterworld..?
|I can't see the noarchive tag at webmasterworld..? |
Probably a server-side coding.
I agree there are some marginal aspects of the "Cached" link which may be useful, but I'm not convinced that any are particularly compelling. As I said in my original message, I've added noarchive to the entire site (apart from a few fluff pages where I added noindex), and I'll see what happens.
On the front page description to this thread, there is the comment:
|the NoArchive tag has become a requirement for most commercial sites |
Are there many here who use noarchive systematically on some or all sites in particular sectors? Do you use it site-wide or just on select pages?
> Do you use it site-wide or just on select pages?
If you sell a service or widget, it is a requirement, at least on your "buy it now" pages. There's been discussion here before where clients will use cached to make purchases or use it as a bargaining tool to receive services for less money. If you also carry inventory information, buyers will buy the widget from the cached page that is now out of stock.
We use the noarchive tag on all pages on all sites. We started with just the Google tag but then had to change as more SE's started caching.
Have not seen any negative results from doing this.
I have also found MSN disobey most of the regular robot rules (though I'm sure if confronted they will have some excuse prepared) but since adding this:
<meta name="MSSmartTagsPreventParsing" content="TRUE">
they seem to be behaving themselves a bit more. Maybe it says "I know what you're up to and I don't like it" to the bot and he pays attention to the other tags ...
... p.s. if you don't hear from me again, check Bill Gates' alibi for the night I was murdered ...
RE: <meta name="MSSmartTagsPreventParsing" content="TRUE">
This meta tag is useless. M$ changed their minds and reversed their plans about implementing the extended features of SmartTags when the web community voiced an very loud objection.
Although the SmartTag framework still exists for those using Windows/IE and there is a possibility that some level of this technology may be put into use in the future, currently the tag does absolutely nothing and everyone I know who initially installed it across their website, has since removed it.
However, if you have evidence that the tag actually does something, I think we'd all be interested.
> I have also found MSN disobey most of the regular robot rules
Now I haven't gone through all my logs, but I'm not aware of MSNbot having done anything untowards (on my site - YMMV!). I also just checked their index and found nothing that shouldn't be there. How recently did you catch them misbehaving, internetheaven?
> they seem to be behaving themselves a bit more.
I highly suspect that this is a coincidence, given that...
> This meta tag is useless.
...and it was/is a tag that works against Microsoft technology. It flys in the face of logic, but I do realize we're talking about Microsoft.
> Do you use it site-wide or just on select pages?
Myself, I do not use it on select pages. That is, most all pages have the attribute but a select few are cachable by Google. This is for the/any psychological effect a freshdate may have on a searcher.
SEO means playing tech games with the engines and mind games with searchers.
An update on the noarchive: I added it to the site early yesterday morning, and the site was visited by both Googlebot and msnbot (Yahoo don't love me!) during the day. This morning, MSN Search is showing the same pages in the same positions with a fresh date and no cache link. Absolutely no movement whatsoever, no negative effect at all for MSN at least. Google isn't showing anything new as yet.
It's obviously very early days, but things are looking good so far.
> Absolutely no movement whatsoever, no negative effect at all for MSN at least.
|I have also found MSN disobey most of the regular robot rules |
I've never had any problem with msnbot, either with robots.txt or robots meta tags. You might want to validate your robots.txt to make sure there's no problem.
For my site, Google are now omitting the Cached link next to some pages. Again, just like MSN, there has been no perceptible change whatsoever in ranking.
I would like to try add "noarchive" to my ebusiness pages, but I'm not a code expert so I'm not sure where to put it. Does it go in the html code in the head or where?
aleatrix, just place the following between your
<meta name="robots" content="noarchive">
hate to bring this up from the dead but have you seen any negative effects yet?
And, to go slightly Off topic - how do you do this on the server side a la webmasterworld?
|have you seen any negative effects yet? |
No problems: I've dropped a couple of slots on my primary keyword, but I've got a bit of a 302 problem from a couple of directories and an authority site has jumped ahead of me with two pages on my subject (I'm still in position 3 and 4). Other keywords are fine if not better, and traffic from Google and MSN has increased. The site has always done badly in Yahoo, no change there.
Of course, the only search engines I know support this tag are Google, MSN and Yahoo - it won't block caching by other sites.
|how do you do this on the server side |
Cloaking ;) Either add it for known bots, or do what I'm doing for a new forum I'm launching soon and make it appear only when a user is not logged in.