Forum Moderators: open

Message Too Old, No Replies

Google Me Not

Staying off the Radar or Trashing the Cache

         

Brett_Tabke

12:48 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



[forbes.com...]

Staying off the radar.

Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.

robotsdobetter

6:43 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I leave my car running does that give you a right to get in and test drive it? No, it's called stealing! The same way with a site.

Psycho111

6:47 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.

LOL, very well put. This indeed is a silly argument....if you don't want someone taking your household goods, YOU LOCK YOUR HOME. Same thing applies to the internet...if you don't want everyone looking at it, then password protect the file(s). Search engines aren't burglars, they will comply with your requests but if you leave it out in the open, they will crawl it.....simple as that.

digitalv

7:02 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I leave my car running does that give you a right to get in and test drive it

I swear, webmasters come up with the most lame and irrelevant analogies I've ever seen. You are really reaching, dude. But to answer your question, if you leave your car running in a place that was designed for anyone to jump in and test drive it, then yes.

robotsdobetter

7:07 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Why should I have to learn about a robots.txt just to keep a robot out? I shouldn't have to post my restrictions, if you don't seen none than leave. If you go into someone's backyard without permission you are trespassing, it's the same with a site. The web was meant for people, not for robots to take my information. What's lame is people coming up with reasons why Google should be able to index my site without permission.

renee

7:12 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



well put, robotsdobetter. thank you.

Scarecrow

7:13 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



If Google makes their search engine available for public use, do zombie PCs have the right to search for email addresses on Google? Not if it amounts to a denial-of-service, certainly. But what if it's done at a lower level, so that it's not a denial-of-service?

Can zombies search for credit-card numbers using those clever Visa number spans, or whatever? Why not?

Psycho111

7:24 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



How hard is it it password protect something or use robots.txt? Even if you had no idea how to do it, there are tons of examples online. It's your responsibility to secure your stuff. Are you telling me you're going to leave your car running in downtown DC with the doors open and expect it to be there when you get back from your errand? If you don't want it stolen, you secure it.

And the search engines aren't stealing it anyway...you still have the pages like someone else said. Most people want the traffic from search engines and the people who don't want it have the responsibility to secure their stuff.

Now let me ask you this....how many webmasters HAVE NO CLUE in regards to search engines.....so tons of sites would be out of the search engines if they had to specifically let search engines in.

[edited by: Psycho111 at 7:27 pm (utc) on Aug. 8, 2004]

GaryK

7:27 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you go into someone's backyard without permission you are trespassing, it's the same with a site.

Another bad analogy. Your backyard is private property whereas the public documents on your website are not.

Going back to the story Brett posted about, people will need to learn sooner or later that the web is an open place. If you put a document out in the open some user agent is going to find it and maybe make use of it.

As an example, ignorance of the law is no excuse for breaking said law. The same thing holds true for the the Internet. If you're going to use the web make sure you understand the implications of posting private stuff on a open or improperly secured website.

In my opinion it's good advice in general to learn all you can about something before you make use of it.

encyclo

7:29 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If I leave my car running does that give you a right to get in and test drive it? No, it's called stealing! The same way with a site.

Your analogy is fundamentally flawed - because the web simply is not like what you describe. It's more like taking your car to a car show, opening all the doors and windows, placing a big sign in front saying "Come take a look!", then complaining when someone takes a photograph.

It is absolutely not stealing, because stealing implies the taking away of something you had. Here we're talking about viewing documents - the fact that one person looks doesn't stop another from doing so.

On a more general point, you can't make a meaningful distinction between "human" and "robot" traffic - one implies the other to at least some extent.

py9jmas

7:33 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



I shouldn't have to post my restrictions, if you don't seen none than leave. If you go into someone's backyard without permission you are trespassing, it's the same with a site.

How do you propose I get permission to browse your site? Instead of a robots.txt defining who you don't want on your site, how about a file listing every single entity on the Internet you do give permission to access your site?

By putting your site on the Internet, you are the one making it accessible to every entity on the net. If you don't like that, find somewhere else to play.

GaryK

7:53 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



how about a file listing every single entity on the Internet you do give permission to access your site?

I know you were just kidding but what a great way to bring the flow of new visitors and potential business to a screeching halt. Not to mention what parsing that file would do to server response times for each requested page. ;)

bcc1234

7:54 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A spider doesn't have the right to open my post or even read it even if I left it lying open on the table.

I guess it's called "publishing on the web" and not "being left lying in the open on the web" for a reason.

Scarecrow

7:59 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



"Your Honor, every one of our PC zombies had the permission of the PC owner to scrape Google. All the owner had to do is read the fine print, and they could have found out how to disable the download that installed our scraping software.

"Furthermore, your Honor, every zombie under our control checked for a zombies.txt file at Google before commencing scraping operations. None of them found one. Our protocol for zombies.txt has been posted at one of our websites, totally.obscure-dot-com, for weeks now.

"Therefore, your honor, we request a dismissal of Google's lawsuit."

victor

8:08 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The web is a public place -- like a highway or park.

It is not a private place like a house or a safe or your table.

If you put something in a public place the public have access to it. That was your choice.

But they don't have rights to steal it or damage it.

If I'm painting in my local park, I cannot stop other people looking at what I am doing. I choose a public place to do it. That was my free choice.

But just being in a public place does not mean my property has become public. Tthe law will (or should) stop passers-by stealing or vandalising my easel.

Similarly, if I put something on a webserver, I cannot stop people or their agents looking at it. But I do retain all my rights over it -- so the law will (one day!) prevent them from vanadalising my computer with viruses or republishing my work without my permission.

What I've lost is a degree of privacy. I'm happy with that. If you are not, several previous posts have suggested ways to increase the degree of privacy. But remember: the web was never designed as a private place.

gpmgroup

8:33 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There are three cases

1) A robots.txt which explicitly denies access
2) A robots.txt which explicitly permits access
3) No robots.txt

The 3rd case is the most interesting... Does a robot have the right to look at, index, (and sometimes reproduce (i.e. cache) anything it finds?

encyclo

8:44 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Does a robot have the right to look at, index, (and sometimes reproduce (i.e. cache) anything it finds?

Of course they can - on your web server, you are running a program (IIS, Apache, whatever) designed specifically to do this.

The web is Allow All by default - by placing a document on a public server running a service on port 80 serving such documents you have given permission.

The only other possibility would be Deny All by default - you need to ask permission every time you click a link. Do you think that would work?

There is no real distinction between robots and human visitors - a robot is just an automated tool sent by a human, and a human visitor can use an automated tool unless you specifically exclude them from doing so.

Ruben

8:54 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



Well with Google you can search any filetype you want. Not only the .PS or PDF. But any file. For example:

x filetype:htaccess
domain filetype:sql

or whatever filetype is considered important.
So google indexes not only the filetypes listed in the advanced search but any file type which is on the server and not protected.

abates

9:56 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



It's even worse now, with "Supplimental Results". Even if you take a page down, it can come back months later complete with cache...

I guess it could be removed by putting a blank-ish page up at the same URL and getting Google to index it, but that assumes you have control over the site in question.

HarryM

10:14 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I find this thread amazing! Are we seriously dicussing the pros and cons of changing the system?

Perhaps someone who advocates a change to opt-in from opt-out for certain page formats would like to explain how this could be achieved?

GaryK

10:31 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I am not an advocate of changing the current system. I'm just the Devil's advocate for a moment.

If Google can come up with a special tag in robots.txt that ALLOWs them to search areas others might be DISALLOWed from it's not too much of a stretch to think there's a way to expand that to specify what file extensions to leave alone.

It's an impractical solution if for no other reason than it would take years for all the bots to recognize and respect the new standard. And there would still be the bots that ignore robots.txt and take whatever they want.

If you want documents to stay out of Google or other search engines then the only guaranteed solutions are a properly password protected folder, or, and this makes much more sense to me, keep them off the web entirely.

creepychris

11:17 pm on Aug 8, 2004 (gmt 0)

10+ Year Member



"There are non-web file types"

First, I don't see how any file is intrisically non-web. Doc files have been listed above as non-web but I have made very successful education sites based on doc files that teachers could download. I am happy that google indexes these file types.

Second, this issue is basically the same issue as the right to deep link. Because that is what search engines basically do. They create pages with deep links to your server. That issue has been hashed out and I thought most people agreed that the web was all about deep linking. And in the cases that came up, didn't the various courts say that since the technical means to prevent deep linking (or in our case, have secure documents) exists, it is up the site owners to use those methods. Website owners do not have a fundemental right to prevent other parties from deep linking. Again, if you don't want it seen in the public, don't put it on the internet OR use technical means to prevent access by any Tom, Dick or Harry.

gpmgroup

11:59 pm on Aug 8, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The web is Allow All by default - by placing a document on a public server running a service on port 80 serving such documents you have given permission.

If someone places a song on a server does anyone have the right to copy it, and post it on another server by default?
i.e. without explicit permission?

creepychris

1:04 am on Aug 9, 2004 (gmt 0)

10+ Year Member



If someone places a song on a server does anyone have the right to copy it, and post it on another server by default?

I think no. I guess that is the real issue. It's not whether search engines have the right to index your webpage but whether they have the right to cache it and serve it up to other people.

yowza

1:27 am on Aug 9, 2004 (gmt 0)

10+ Year Member



It should be opt-in. Why are they allowed to cache your page without your permission?

If you consider each page as a work of art, which I believe web pages could be considered art, then Search Engines have no right to use your art to make money without your permission.

Think of it this way. If I took a public work like a book, copied the cover, gave a short summary, showed all complete pages, put them on a website, and made advertising money from displaying these books, I would be in trouble.

Google is not doing a public service; they are making money off of displaying other people's work.

To take it a step further, would you argue that you could display the books because the publisher didn't tell you that you couldn't? No. YOu have to get permission from the publisher, just as you should have to get permission from the website owner.

digitalv

3:55 am on Aug 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You know you can remove your entire website from Google, right?

[google.com...]

Stop whining.

BigDave

4:56 am on Aug 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



yowsa,

You really need to do some reading about how copyright and Fair Use works.

Commercial use CAN be considered Fair Use. Whether or not the use is commercial is only one of the factors involved in determining Fair Use.

does Google go beyond Fair Use? Feel free to be the test case and file suit. But before you do, I suggest you learn more about copyright than you currently do.

amznVibe

5:10 am on Aug 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Being on the internet is a public thing, like driving your car down a public road.

People can look into your car. If you don't like that, don't leave your house (ie. keep it on an intranet). You can put tint on the windows to slow people down (ie. passwords, encryption, etc.) but in the end, if someone wants to walk up to your car in a parking lot and put their nose up against the glass, they are gonna see into your car. (now if they put a brick through the window to take someone out of the car belonging to you, that's entirely different - like webpage content theft, etc)

Having a website on the internet requires either certain technical knowledge or using the knowledge of techs. You can't just magically be on the internet. Claiming that you should not have to know about robots.txt is like claiming you should be able to have a web page without knowing how to create or post a page. (You can drive a car without having a license but sooner or later you are gonna get into major trouble - should you have known better? should someone have told you that you need to know the rules of driving or did you just assume you could get into a car and start driving around?)

If you do end-runs around required knowledge to post and operate a website (ie. by using frontpage without learning what is really happening)you are still subject to "the rules" of how things work. Ignorance is no excuse, you are the one that wanted a website and did it without learning or purchasing the right knowledge.

Now if search engines did not offer a way to remove something that was indexed that you didn't want to be, *that* would be an outrage. But it's not the case.

Keeping a cached copy of a page and making it public is like someone taking a photograph of your car in public and showing someone else what was inside your car when you took it out that day. Again, if you don't like it, don't take your car out on the road. They could not have legally taken that picture in your garage.

yowza

7:45 am on Aug 9, 2004 (gmt 0)

10+ Year Member



Stop whining.

hmmm... I don't recall whining...

I thought we were having a discussion among adults and not a 5 year old "crybaby" conversation.

I realize I'm not a copyright expert, but I don't think anybody could get away with doing as I mention above with the books whether it is for commercial use or not.

Having a website on the internet requires either certain technical knowledge or using the knowledge of techs. You can't just magically be on the internet. Claiming that you should not have to know about robots.txt is like claiming you should be able to have a web page without knowing how to create or post a page.

You can write a book without knowing anything except how to work a pencil and paper (maybe an eraser). Just because you don't know how to print it, type it, or know the copyright laws doesn't mean that other people can use it. Could somebody steal my book and say "well, you didn't know how to set the press, choose the ink, or apply the copyright laws, therefore I can use an exact copy of it to make money"? Can any person now take a copy of my book and copy it? NO! I don't need to know copyright laws to know that.

Keeping a cached copy of a page and making it public is like someone taking a photograph of your car in public and showing someone else what was inside your car when you took it out that day. Again, if you don't like it, don't take your car out on the road. They could not have legally taken that picture in your garage.

So are you saying that I could go check a book out of the library, take photos of every page, run copies, staple them together, and legally sell the book because the library is a public place? I don't think so. I could be sued by the publisher. If I went to court and said, "well, the publisher never contacted me to tell me that I couldn't do this", or "the author didn't know the copyright laws, thus I can do this," I would be laughed out of court.

I'm not saying I'm against search engines indexing my site or caching pages; in fact, I allow them to do both. All I'm arguing is the legality of it.

Patrick Taylor

8:57 am on Aug 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



the legality of it

The web is the most public medium on earth - that's the beauty of it. Webservers serve the web and Google indexes it, and as part of their service they cache the documents that people deliberately put there in a public place. That's how it works and I doubt if there's any illegality involved, especially as these are the generally understood and accepted working principles of the medium. And the cached document is still presented as your document and not Google's. The rest of it is just an index of documents that were deliberately published. I don't think copyright comes into this at all.

amznVibe

9:12 am on Aug 9, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



yowza you keep comparing the cached content as selling your content? Where is a cached copy making someone money? Google does not run ads next to your cached content, neither does any other archiver that I know of.

By the way, taking a photograph (xerox) of pages in a book for PERSONAL use is actually legal I believe (IANAL) but selling it is the illegal part. If not, then every public library and college library in the USA is aiding and abbeting criminals.

[edited by: amznVibe at 9:12 am (utc) on Aug. 9, 2004]

This 118 message thread spans 4 pages: 118