I think Google and other search engines are stepping over a line when indexing non web file types. Indexing of those documents should be opt in. A spider doesn't have the right to open my post or even read it even if I left it lying open on the table.
File types never meant to be made public (doc, xls, ppt) should only be indexed when indicated. Only files meant for the web (html, php, asp, xhtml, etc) should be indexed freely. Some are ambiguous, but most are not.
I saw this article last week in the magazine and the thing that amazed me the most was that there are now articles about how to stay out of Google.
Hitprof - great analogy about the mail. I guess that PDFs would be one of the more ambiguous file types.
Unfortunately, hitprof, there is no such thing as non web file types
There are only files that are on the web -- i.e. have a link to them -- And those that aren't.
If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by: putting the file in a folder banned by robots.txt
have the linking page have a robots meta for noindex and/or nofollow amd/or noarchive.
Google will honor those instructions. Other search engines and most humans won't. For real protection, put anything you don't the world to see behind a password-required barrier. Or not on the web at all.
Any personal files transferred anywhere should be encrypted, but it might only be a matter of time before SE's start using super-computers to break into even those.
It seems more a case of blaming Google and other search engines for one's own lax security procedures. We are talking about publically-accessible web servers here. The idea about "non-web" documents won't work either - if you place a document in a public place, you should expect it to be read - by search engine bots, curious visitors, crackers and the rest. There is no reason why a Word document, a PDF file or anything else shouldn't be indexed when the person has chosen to make it available to all.
It is up to the file owner to decide whether a file should be secured against public viewing or not - a search engine bot cannot make such a distinction.
I donít think this is anything particularly new to regular readers of this forum. Surely most here have typed robots.txt in the url of a competitor and stats or weblog ect.
I agree that the average user has no idea how powerful the search engines are and what type of information can be obtained. If it is in a public server you have to assume it will be accessed by someone or something.
i think the protocol should be changed to opt-in, not opt-out - just like email lists.
if i left my house door open, it does not mean i am giving permission for everybody, anybody to come in and do what they please with my property.
|i think the protocol should be changed to opt-in, not opt-out - just like email lists. |
It is opt-in. If you opt-in, you make your document accessible. If you don't want it indexed, don't make it accessible. Sounds simple enough to me.
no. opt-in means you give explicit permission to be included in the index. opt-out is the current protocol whereby you give explicit instructions to be excluded through the robots.txt and the meta-tags.
this will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.
(opt-in) will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.
Although some of these webstats are explicitly linked by site owners, I think Google would be doing us all a favour by making webstat-type pages opt-in only. Not only would it squelch a lot of spam, it would make searching for information on robots and user-agents a whole lot easier.
Opt-in is when you explicitly place documents on a public web server - in doing so, you are indicating that such documents are for the public consumption, and therefore the search bots are completely correct in indexing them. If they are not for public viewing, what are they doing there? To reuse your analogy, you are opening your house to the public for viewing (not for stealing) without any restriction on who can enter. If there are rooms you don't want people to view, then you need to lock the doors. If there are certain visitors you don't want at all, then you need someone controlling access at the front door (.htaccess, robots.txt).
If documents are private, then don't publish them. If your network is insecure, then it is your responability - a search bot cannot be expected to distinguish between an intentionally-public server and an inadvertently-public server.
|making webstat-type pages opt-in only |
1. How can the search bot tell that the page is a web stats page (which you think shouldn't be indexed) and not an example by the sales team of a stats program (which they would want indexed), or any other table of text and figures? The bots can't and don't read, analyse or understand - a web stats page is just the same as any other HTML page.
2. I regularly visit a site which has an explicit link to their stats page in their main menu. They want their stats to be public (for whatever reason). Why should the search engines refuse?
The makers of the stats programs could add a robots meta tag to their generated pages by default, but how do they know that they are doing what their customers want (see 2.)? Furthermore, why should they be responsible any more than the search engines?
|if i left my house door open |
Renee, I don't agree. It's not your house door. Your "house" is your personal computer. A webserver is on the World Wide Web. If you deliberately leave your personal documents on a table in the public library then you are by default saying they're public, which would be different to leaving them in the private care of the librarian.
by your defintion, an email address published in a web page is fair game to any and all spammers. this is the reason that the concept of opt-in and opt-out came about. this is simply giving explicit permission for a third party to utilize/incorporate what belongs to me. no explicit permission does not mean permission. this is a matter of protocol. i'm not disagreeing that this is the current protocol. i'm just proposing that it should be changed.
if my house door is open, anybody can look but anybody who takes anything without my permission is stealing. And the thief cannot use the excuse that my door was open.
>>if my house door is open, anybody can look
Here you go. They are just taking a photo of your house (cache) not taking the actual object (i.e. your pages still exist in your server - your house)
If search engines are going to display cached pages, they should support an enhanced robots.txt standard. The following would be ok
Note that you cannot use meta tags to prevent caching of images etc.
disallow-archive: would be a sensible synonym.
are you giving permission for anybody to take pictures of you doing private things in your house and publish them in the web and all media? and they'll say you're fair game - you did not draw down your window drapes!
I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).
>I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).
And if these documents are sensitive such they wouldn't want the world to know what is in them, they are *stupid*. Googlebot isn't into hacking. It just follows links. If someone puts documents on a web server with links SE bots can find, this isn't just security by obscurity. It's security based on stupidity. DON'T DO THIS!
That's very false, I shouldn't have to tell them not to come to my site! Why should I have to create a robots.txt page and waste more space on my server? I don't care if they come to my site, but when someone says it's fair game than you better think again, I created my site, not Google, Yahoo! or the new MSN!
|If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by: |
If someone invented an x-ray camera that was able to take pictures of people and see through their clothes, unless these people had on two sets of underwear, who is responsible for making sure you aren't photographed?
1. You, because these days no one should be stupid enough to wear just one set of underwear.
2. The inventor, because for many decades people didn't have to wear two sets of underwear.
Well said, Scarecrow.
rfgdxm1, I'm not talking about security but about common sense. Of course one shouldn't place the most sensitive data on the web - that's your ow resposibility.
Much to my surprise I just found a reference to GoogleGuy's very first post on Webmasterworld and look what it's about (and read Brett's answer in msg # 30 as well:
somebody invented email. anybody can send you an email. you are now getting a lot of spam. who's fault is it?
1. you because you are stupid enough to have an email address.
2. the inventor of email.
ridiculous reasoning, isn't it?
If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.
the issue is whether the protocol should be opt-in or opt-out for search engines.
the protocol today is opt-out - meaning if you do not explicitly specify that your site or pages are not to be indexed, then you are implicitly giving your permisssion.
the issue is whether the protocol should be changed. will it be better for the internet?
this protocol is not a substite for security or privacy. whichever protocol you use there will always be spammers and hackers. we should continue and always be concerned with security. just like email spam - just because the accepted protocol now is opt-out, it does not mean that email spam will go away. but now we have reason to tell the perpetrators that you did not give your permission to receive spam.
If you put a document online, do I have to contact you to get permission to look at it? Am I forced to use one particular way of accessing that document (eg. via a browser) or can I use other methods (eg. wget or another spider-like download tool)? How am I supposed to know what your restrictions are if you don't specify them? I can't contact every site in advance before clicking on a link - the web just doesn't work like that.
If you want to put up information which is "restricted access" for friends and family, why are you putting it on a competely open system without defining any boundaries? If you stuck a notice on a tree in your street for your friends, how can you expect to keep it private from anyone else passing?
Another analogy: you buy a field in the middle of a public park, and you don't put a fence, signpost or anything which allows someone to distinguish it from the public space that surrounds it. Can you blame people who inadvertently tresspass?
I'll say it again: if you put it on the web, you are already giving explicit permission to access that document by any means. If you don't want unfettered access then you must use available tools like .htaccess and robots.txt to define the rules of access which correspond to your needs.
I think there is some fundamental misunderstanding of the consequences of the technology here.
|If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder. |
That pretty much says it all. It's the freaking Internet, people - by its very nature everything is "public" unless defined otherwise by the server administrator. If you don't want anyone to see it, don't put it in a public place.
|this will also reduce significantly the number of useless pages in the google index. |
I think it would have the opposite effect, because a great deal of useful information is created by academics and other subject experts who aren't into SEO, have never heard of robots.txt, and wouldn't know how to (or even that they should) opt in by taking any step beyond the obvious one of transferring their files to a public server.
I'm confused by why this is an issue.
Public websites are just that, public. The various (legitamate) spiders out there don't actively try to find information you haven't disclosed to anyone else. The whole argument that the public shouldn't be allowed into an area that is not cordened off from the public seems silly. The whole "X-ray" camera thing isn't even an argument, it's moot because legitimate engines don't do anything that sophisticated.
If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.
Look, the web is public, so is anonymous FTP and (frankly) unsecured email. Your privacy is ultimately your responsibility, not someone else's. If you have materials you wish to keep secure, then by all means secure them.
Cripes, do you people wear your credit-card number on your t-shirts too?
| This 118 message thread spans 4 pages: 118 (  2 3 4 ) > > |