Forum Moderators: open
Staying off the radar.
Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.
If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.
If I leave my car running does that give you a right to get in and test drive it
I swear, webmasters come up with the most lame and irrelevant analogies I've ever seen. You are really reaching, dude. But to answer your question, if you leave your car running in a place that was designed for anyone to jump in and test drive it, then yes.
Can zombies search for credit-card numbers using those clever Visa number spans, or whatever? Why not?
And the search engines aren't stealing it anyway...you still have the pages like someone else said. Most people want the traffic from search engines and the people who don't want it have the responsibility to secure their stuff.
Now let me ask you this....how many webmasters HAVE NO CLUE in regards to search engines.....so tons of sites would be out of the search engines if they had to specifically let search engines in.
[edited by: Psycho111 at 7:27 pm (utc) on Aug. 8, 2004]
If you go into someone's backyard without permission you are trespassing, it's the same with a site.
Another bad analogy. Your backyard is private property whereas the public documents on your website are not.
Going back to the story Brett posted about, people will need to learn sooner or later that the web is an open place. If you put a document out in the open some user agent is going to find it and maybe make use of it.
As an example, ignorance of the law is no excuse for breaking said law. The same thing holds true for the the Internet. If you're going to use the web make sure you understand the implications of posting private stuff on a open or improperly secured website.
In my opinion it's good advice in general to learn all you can about something before you make use of it.
If I leave my car running does that give you a right to get in and test drive it? No, it's called stealing! The same way with a site.
Your analogy is fundamentally flawed - because the web simply is not like what you describe. It's more like taking your car to a car show, opening all the doors and windows, placing a big sign in front saying "Come take a look!", then complaining when someone takes a photograph.
It is absolutely not stealing, because stealing implies the taking away of something you had. Here we're talking about viewing documents - the fact that one person looks doesn't stop another from doing so.
On a more general point, you can't make a meaningful distinction between "human" and "robot" traffic - one implies the other to at least some extent.
I shouldn't have to post my restrictions, if you don't seen none than leave. If you go into someone's backyard without permission you are trespassing, it's the same with a site.
How do you propose I get permission to browse your site? Instead of a robots.txt defining who you don't want on your site, how about a file listing every single entity on the Internet you do give permission to access your site?
By putting your site on the Internet, you are the one making it accessible to every entity on the net. If you don't like that, find somewhere else to play.
how about a file listing every single entity on the Internet you do give permission to access your site?
I know you were just kidding but what a great way to bring the flow of new visitors and potential business to a screeching halt. Not to mention what parsing that file would do to server response times for each requested page. ;)
"Furthermore, your Honor, every zombie under our control checked for a zombies.txt file at Google before commencing scraping operations. None of them found one. Our protocol for zombies.txt has been posted at one of our websites, totally.obscure-dot-com, for weeks now.
"Therefore, your honor, we request a dismissal of Google's lawsuit."
It is not a private place like a house or a safe or your table.
If you put something in a public place the public have access to it. That was your choice.
But they don't have rights to steal it or damage it.
If I'm painting in my local park, I cannot stop other people looking at what I am doing. I choose a public place to do it. That was my free choice.
But just being in a public place does not mean my property has become public. Tthe law will (or should) stop passers-by stealing or vandalising my easel.
Similarly, if I put something on a webserver, I cannot stop people or their agents looking at it. But I do retain all my rights over it -- so the law will (one day!) prevent them from vanadalising my computer with viruses or republishing my work without my permission.
What I've lost is a degree of privacy. I'm happy with that. If you are not, several previous posts have suggested ways to increase the degree of privacy. But remember: the web was never designed as a private place.
Does a robot have the right to look at, index, (and sometimes reproduce (i.e. cache) anything it finds?
Of course they can - on your web server, you are running a program (IIS, Apache, whatever) designed specifically to do this.
The web is Allow All by default - by placing a document on a public server running a service on port 80 serving such documents you have given permission.
The only other possibility would be Deny All by default - you need to ask permission every time you click a link. Do you think that would work?
There is no real distinction between robots and human visitors - a robot is just an automated tool sent by a human, and a human visitor can use an automated tool unless you specifically exclude them from doing so.
x filetype:htaccess
domain filetype:sql
or whatever filetype is considered important.
So google indexes not only the filetypes listed in the advanced search but any file type which is on the server and not protected.
I guess it could be removed by putting a blank-ish page up at the same URL and getting Google to index it, but that assumes you have control over the site in question.
If Google can come up with a special tag in robots.txt that ALLOWs them to search areas others might be DISALLOWed from it's not too much of a stretch to think there's a way to expand that to specify what file extensions to leave alone.
It's an impractical solution if for no other reason than it would take years for all the bots to recognize and respect the new standard. And there would still be the bots that ignore robots.txt and take whatever they want.
If you want documents to stay out of Google or other search engines then the only guaranteed solutions are a properly password protected folder, or, and this makes much more sense to me, keep them off the web entirely.
First, I don't see how any file is intrisically non-web. Doc files have been listed above as non-web but I have made very successful education sites based on doc files that teachers could download. I am happy that google indexes these file types.
Second, this issue is basically the same issue as the right to deep link. Because that is what search engines basically do. They create pages with deep links to your server. That issue has been hashed out and I thought most people agreed that the web was all about deep linking. And in the cases that came up, didn't the various courts say that since the technical means to prevent deep linking (or in our case, have secure documents) exists, it is up the site owners to use those methods. Website owners do not have a fundemental right to prevent other parties from deep linking. Again, if you don't want it seen in the public, don't put it on the internet OR use technical means to prevent access by any Tom, Dick or Harry.
The web is Allow All by default - by placing a document on a public server running a service on port 80 serving such documents you have given permission.
If someone places a song on a server does anyone have the right to copy it, and post it on another server by default?
i.e. without explicit permission?
If someone places a song on a server does anyone have the right to copy it, and post it on another server by default?
I think no. I guess that is the real issue. It's not whether search engines have the right to index your webpage but whether they have the right to cache it and serve it up to other people.
If you consider each page as a work of art, which I believe web pages could be considered art, then Search Engines have no right to use your art to make money without your permission.
Think of it this way. If I took a public work like a book, copied the cover, gave a short summary, showed all complete pages, put them on a website, and made advertising money from displaying these books, I would be in trouble.
Google is not doing a public service; they are making money off of displaying other people's work.
To take it a step further, would you argue that you could display the books because the publisher didn't tell you that you couldn't? No. YOu have to get permission from the publisher, just as you should have to get permission from the website owner.
You really need to do some reading about how copyright and Fair Use works.
Commercial use CAN be considered Fair Use. Whether or not the use is commercial is only one of the factors involved in determining Fair Use.
does Google go beyond Fair Use? Feel free to be the test case and file suit. But before you do, I suggest you learn more about copyright than you currently do.
People can look into your car. If you don't like that, don't leave your house (ie. keep it on an intranet). You can put tint on the windows to slow people down (ie. passwords, encryption, etc.) but in the end, if someone wants to walk up to your car in a parking lot and put their nose up against the glass, they are gonna see into your car. (now if they put a brick through the window to take someone out of the car belonging to you, that's entirely different - like webpage content theft, etc)
Having a website on the internet requires either certain technical knowledge or using the knowledge of techs. You can't just magically be on the internet. Claiming that you should not have to know about robots.txt is like claiming you should be able to have a web page without knowing how to create or post a page. (You can drive a car without having a license but sooner or later you are gonna get into major trouble - should you have known better? should someone have told you that you need to know the rules of driving or did you just assume you could get into a car and start driving around?)
If you do end-runs around required knowledge to post and operate a website (ie. by using frontpage without learning what is really happening)you are still subject to "the rules" of how things work. Ignorance is no excuse, you are the one that wanted a website and did it without learning or purchasing the right knowledge.
Now if search engines did not offer a way to remove something that was indexed that you didn't want to be, *that* would be an outrage. But it's not the case.
Keeping a cached copy of a page and making it public is like someone taking a photograph of your car in public and showing someone else what was inside your car when you took it out that day. Again, if you don't like it, don't take your car out on the road. They could not have legally taken that picture in your garage.
Stop whining.
I thought we were having a discussion among adults and not a 5 year old "crybaby" conversation.
I realize I'm not a copyright expert, but I don't think anybody could get away with doing as I mention above with the books whether it is for commercial use or not.
Having a website on the internet requires either certain technical knowledge or using the knowledge of techs. You can't just magically be on the internet. Claiming that you should not have to know about robots.txt is like claiming you should be able to have a web page without knowing how to create or post a page.
You can write a book without knowing anything except how to work a pencil and paper (maybe an eraser). Just because you don't know how to print it, type it, or know the copyright laws doesn't mean that other people can use it. Could somebody steal my book and say "well, you didn't know how to set the press, choose the ink, or apply the copyright laws, therefore I can use an exact copy of it to make money"? Can any person now take a copy of my book and copy it? NO! I don't need to know copyright laws to know that.
Keeping a cached copy of a page and making it public is like someone taking a photograph of your car in public and showing someone else what was inside your car when you took it out that day. Again, if you don't like it, don't take your car out on the road. They could not have legally taken that picture in your garage.
So are you saying that I could go check a book out of the library, take photos of every page, run copies, staple them together, and legally sell the book because the library is a public place? I don't think so. I could be sued by the publisher. If I went to court and said, "well, the publisher never contacted me to tell me that I couldn't do this", or "the author didn't know the copyright laws, thus I can do this," I would be laughed out of court.
I'm not saying I'm against search engines indexing my site or caching pages; in fact, I allow them to do both. All I'm arguing is the legality of it.
the legality of it
The web is the most public medium on earth - that's the beauty of it. Webservers serve the web and Google indexes it, and as part of their service they cache the documents that people deliberately put there in a public place. That's how it works and I doubt if there's any illegality involved, especially as these are the generally understood and accepted working principles of the medium. And the cached document is still presented as your document and not Google's. The rest of it is just an index of documents that were deliberately published. I don't think copyright comes into this at all.
By the way, taking a photograph (xerox) of pages in a book for PERSONAL use is actually legal I believe (IANAL) but selling it is the illegal part. If not, then every public library and college library in the USA is aiding and abbeting criminals.
[edited by: amznVibe at 9:12 am (utc) on Aug. 9, 2004]