Forum Moderators: open
Staying off the radar.
Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.
File types never meant to be made public (doc, xls, ppt) should only be indexed when indicated. Only files meant for the web (html, php, asp, xhtml, etc) should be indexed freely. Some are ambiguous, but most are not.
There are only files that are on the web -- i.e. have a link to them -- And those that aren't.
If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:
Google will honor those instructions. Other search engines and most humans won't. For real protection, put anything you don't the world to see behind a password-required barrier. Or not on the web at all.
It is up to the file owner to decide whether a file should be secured against public viewing or not - a search engine bot cannot make such a distinction.
I agree that the average user has no idea how powerful the search engines are and what type of information can be obtained. If it is in a public server you have to assume it will be accessed by someone or something.
this will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.
(opt-in) will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.
Although some of these webstats are explicitly linked by site owners, I think Google would be doing us all a favour by making webstat-type pages opt-in only. Not only would it squelch a lot of spam, it would make searching for information on robots and user-agents a whole lot easier.
If documents are private, then don't publish them. If your network is insecure, then it is your responability - a search bot cannot be expected to distinguish between an intentionally-public server and an inadvertently-public server.
making webstat-type pages opt-in only
1. How can the search bot tell that the page is a web stats page (which you think shouldn't be indexed) and not an example by the sales team of a stats program (which they would want indexed), or any other table of text and figures? The bots can't and don't read, analyse or understand - a web stats page is just the same as any other HTML page.
2. I regularly visit a site which has an explicit link to their stats page in their main menu. They want their stats to be public (for whatever reason). Why should the search engines refuse?
The makers of the stats programs could add a robots meta tag to their generated pages by default, but how do they know that they are doing what their customers want (see 2.)? Furthermore, why should they be responsible any more than the search engines?
if i left my house door open
Renee, I don't agree. It's not your house door. Your "house" is your personal computer. A webserver is on the World Wide Web. If you deliberately leave your personal documents on a table in the public library then you are by default saying they're public, which would be different to leaving them in the private care of the librarian.
if my house door is open, anybody can look but anybody who takes anything without my permission is stealing. And the thief cannot use the excuse that my door was open.
user-agent: *
disallow:
disallow-cache: /images/
disallow-cache: /presentations/
Note that you cannot use meta tags to prevent caching of images etc.
disallow-archive: would be a sensible synonym.
Kaled.
And if these documents are sensitive such they wouldn't want the world to know what is in them, they are *stupid*. Googlebot isn't into hacking. It just follows links. If someone puts documents on a web server with links SE bots can find, this isn't just security by obscurity. It's security based on stupidity. DON'T DO THIS!
If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:That's very false, I shouldn't have to tell them not to come to my site! Why should I have to create a robots.txt page and waste more space on my server? I don't care if they come to my site, but when someone says it's fair game than you better think again, I created my site, not Google, Yahoo! or the new MSN!
1. You, because these days no one should be stupid enough to wear just one set of underwear.
2. The inventor, because for many decades people didn't have to wear two sets of underwear.
rfgdxm1, I'm not talking about security but about common sense. Of course one shouldn't place the most sensitive data on the web - that's your ow resposibility.
Much to my surprise I just found a reference to GoogleGuy's very first post on Webmasterworld and look what it's about (and read Brett's answer in msg # 30 as well:
[webmasterworld.com...]
the protocol today is opt-out - meaning if you do not explicitly specify that your site or pages are not to be indexed, then you are implicitly giving your permisssion.
the issue is whether the protocol should be changed. will it be better for the internet?
this protocol is not a substite for security or privacy. whichever protocol you use there will always be spammers and hackers. we should continue and always be concerned with security. just like email spam - just because the accepted protocol now is opt-out, it does not mean that email spam will go away. but now we have reason to tell the perpetrators that you did not give your permission to receive spam.
If you want to put up information which is "restricted access" for friends and family, why are you putting it on a competely open system without defining any boundaries? If you stuck a notice on a tree in your street for your friends, how can you expect to keep it private from anyone else passing?
Another analogy: you buy a field in the middle of a public park, and you don't put a fence, signpost or anything which allows someone to distinguish it from the public space that surrounds it. Can you blame people who inadvertently tresspass?
I'll say it again: if you put it on the web, you are already giving explicit permission to access that document by any means. If you don't want unfettered access then you must use available tools like .htaccess and robots.txt to define the rules of access which correspond to your needs.
I think there is some fundamental misunderstanding of the consequences of the technology here.
If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.
That pretty much says it all. It's the freaking Internet, people - by its very nature everything is "public" unless defined otherwise by the server administrator. If you don't want anyone to see it, don't put it in a public place.
this will also reduce significantly the number of useless pages in the google index.
I think it would have the opposite effect, because a great deal of useful information is created by academics and other subject experts who aren't into SEO, have never heard of robots.txt, and wouldn't know how to (or even that they should) opt in by taking any step beyond the obvious one of transferring their files to a public server.
Public websites are just that, public. The various (legitamate) spiders out there don't actively try to find information you haven't disclosed to anyone else. The whole argument that the public shouldn't be allowed into an area that is not cordened off from the public seems silly. The whole "X-ray" camera thing isn't even an argument, it's moot because legitimate engines don't do anything that sophisticated.
If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.
Look, the web is public, so is anonymous FTP and (frankly) unsecured email. Your privacy is ultimately your responsibility, not someone else's. If you have materials you wish to keep secure, then by all means secure them.
Cripes, do you people wear your credit-card number on your t-shirts too?