homepage Welcome to WebmasterWorld Guest from 54.226.147.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google News Archive
Forum Library, Charter, Moderator: open

Google News Archive Forum

This 118 message thread spans 4 pages: 118 ( [1] 2 3 4 > >     
Google Me Not
Staying off the Radar or Trashing the Cache
Brett_Tabke

WebmasterWorld Administrator brett_tabke us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 25213 posted 12:48 pm on Aug 8, 2004 (gmt 0)

[forbes.com...]

Staying off the radar.
Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.

 

HitProf

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25213 posted 1:38 pm on Aug 8, 2004 (gmt 0)

I think Google and other search engines are stepping over a line when indexing non web file types. Indexing of those documents should be opt in. A spider doesn't have the right to open my post or even read it even if I left it lying open on the table.

File types never meant to be made public (doc, xls, ppt) should only be indexed when indicated. Only files meant for the web (html, php, asp, xhtml, etc) should be indexed freely. Some are ambiguous, but most are not.

christopher w

10+ Year Member



 
Msg#: 25213 posted 1:51 pm on Aug 8, 2004 (gmt 0)

I saw this article last week in the magazine and the thing that amazed me the most was that there are now articles about how to stay out of Google.

Hitprof - great analogy about the mail. I guess that PDFs would be one of the more ambiguous file types.

victor

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 25213 posted 1:57 pm on Aug 8, 2004 (gmt 0)

Unfortunately, hitprof, there is no such thing as non web file types

There are only files that are on the web -- i.e. have a link to them -- And those that aren't.

If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:

  • putting the file in a folder banned by robots.txt
  • have the linking page have a robots meta for noindex and/or nofollow amd/or noarchive.

    Google will honor those instructions. Other search engines and most humans won't. For real protection, put anything you don't the world to see behind a password-required barrier. Or not on the web at all.

  • longen

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 1:59 pm on Aug 8, 2004 (gmt 0)

    Any personal files transferred anywhere should be encrypted, but it might only be a matter of time before SE's start using super-computers to break into even those.

    encyclo

    WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 2:22 pm on Aug 8, 2004 (gmt 0)

    It seems more a case of blaming Google and other search engines for one's own lax security procedures. We are talking about publically-accessible web servers here. The idea about "non-web" documents won't work either - if you place a document in a public place, you should expect it to be read - by search engine bots, curious visitors, crackers and the rest. There is no reason why a Word document, a PDF file or anything else shouldn't be indexed when the person has chosen to make it available to all.

    It is up to the file owner to decide whether a file should be secured against public viewing or not - a search engine bot cannot make such a distinction.

    skippy

    10+ Year Member



     
    Msg#: 25213 posted 2:59 pm on Aug 8, 2004 (gmt 0)

    I donít think this is anything particularly new to regular readers of this forum. Surely most here have typed robots.txt in the url of a competitor and stats or weblog ect.

    I agree that the average user has no idea how powerful the search engines are and what type of information can be obtained. If it is in a public server you have to assume it will be accessed by someone or something.

    renee

    10+ Year Member



     
    Msg#: 25213 posted 3:02 pm on Aug 8, 2004 (gmt 0)

    i think the protocol should be changed to opt-in, not opt-out - just like email lists.

    if i left my house door open, it does not mean i am giving permission for everybody, anybody to come in and do what they please with my property.

    volatilegx

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 3:09 pm on Aug 8, 2004 (gmt 0)

    i think the protocol should be changed to opt-in, not opt-out - just like email lists.

    It is opt-in. If you opt-in, you make your document accessible. If you don't want it indexed, don't make it accessible. Sounds simple enough to me.

    renee

    10+ Year Member



     
    Msg#: 25213 posted 3:18 pm on Aug 8, 2004 (gmt 0)

    no. opt-in means you give explicit permission to be included in the index. opt-out is the current protocol whereby you give explicit instructions to be excluded through the robots.txt and the meta-tags.

    this will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.

    Rosalind

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 3:35 pm on Aug 8, 2004 (gmt 0)


    (opt-in) will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.

    Although some of these webstats are explicitly linked by site owners, I think Google would be doing us all a favour by making webstat-type pages opt-in only. Not only would it squelch a lot of spam, it would make searching for information on robots and user-agents a whole lot easier.

    encyclo

    WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 3:37 pm on Aug 8, 2004 (gmt 0)

    Opt-in is when you explicitly place documents on a public web server - in doing so, you are indicating that such documents are for the public consumption, and therefore the search bots are completely correct in indexing them. If they are not for public viewing, what are they doing there? To reuse your analogy, you are opening your house to the public for viewing (not for stealing) without any restriction on who can enter. If there are rooms you don't want people to view, then you need to lock the doors. If there are certain visitors you don't want at all, then you need someone controlling access at the front door (.htaccess, robots.txt).

    If documents are private, then don't publish them. If your network is insecure, then it is your responability - a search bot cannot be expected to distinguish between an intentionally-public server and an inadvertently-public server.

    encyclo

    WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 3:43 pm on Aug 8, 2004 (gmt 0)

    making webstat-type pages opt-in only

    1. How can the search bot tell that the page is a web stats page (which you think shouldn't be indexed) and not an example by the sales team of a stats program (which they would want indexed), or any other table of text and figures? The bots can't and don't read, analyse or understand - a web stats page is just the same as any other HTML page.

    2. I regularly visit a site which has an explicit link to their stats page in their main menu. They want their stats to be public (for whatever reason). Why should the search engines refuse?

    The makers of the stats programs could add a robots meta tag to their generated pages by default, but how do they know that they are doing what their customers want (see 2.)? Furthermore, why should they be responsible any more than the search engines?

    Patrick Taylor

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 3:43 pm on Aug 8, 2004 (gmt 0)

    if i left my house door open

    Renee, I don't agree. It's not your house door. Your "house" is your personal computer. A webserver is on the World Wide Web. If you deliberately leave your personal documents on a table in the public library then you are by default saying they're public, which would be different to leaving them in the private care of the librarian.

    renee

    10+ Year Member



     
    Msg#: 25213 posted 3:55 pm on Aug 8, 2004 (gmt 0)

    by your defintion, an email address published in a web page is fair game to any and all spammers. this is the reason that the concept of opt-in and opt-out came about. this is simply giving explicit permission for a third party to utilize/incorporate what belongs to me. no explicit permission does not mean permission. this is a matter of protocol. i'm not disagreeing that this is the current protocol. i'm just proposing that it should be changed.

    if my house door is open, anybody can look but anybody who takes anything without my permission is stealing. And the thief cannot use the excuse that my door was open.

    Chndru

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 3:58 pm on Aug 8, 2004 (gmt 0)

    >>if my house door is open, anybody can look

    Here you go. They are just taking a photo of your house (cache) not taking the actual object (i.e. your pages still exist in your server - your house)

    kaled

    WebmasterWorld Senior Member kaled us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 4:04 pm on Aug 8, 2004 (gmt 0)

    If search engines are going to display cached pages, they should support an enhanced robots.txt standard. The following would be ok

    user-agent: *
    disallow:
    disallow-cache: /images/
    disallow-cache: /presentations/

    Note that you cannot use meta tags to prevent caching of images etc.

    disallow-archive: would be a sensible synonym.

    Kaled.

    renee

    10+ Year Member



     
    Msg#: 25213 posted 4:04 pm on Aug 8, 2004 (gmt 0)

    chndru,

    are you giving permission for anybody to take pictures of you doing private things in your house and publish them in the web and all media? and they'll say you're fair game - you did not draw down your window drapes!

    HitProf

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 4:37 pm on Aug 8, 2004 (gmt 0)

    I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).

    rfgdxm1

    WebmasterWorld Senior Member rfgdxm1 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 5:04 pm on Aug 8, 2004 (gmt 0)

    >I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).

    And if these documents are sensitive such they wouldn't want the world to know what is in them, they are *stupid*. Googlebot isn't into hacking. It just follows links. If someone puts documents on a web server with links SE bots can find, this isn't just security by obscurity. It's security based on stupidity. DON'T DO THIS!

    robotsdobetter

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 5:19 pm on Aug 8, 2004 (gmt 0)

    If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:
    That's very false, I shouldn't have to tell them not to come to my site! Why should I have to create a robots.txt page and waste more space on my server? I don't care if they come to my site, but when someone says it's fair game than you better think again, I created my site, not Google, Yahoo! or the new MSN!
    Scarecrow

    10+ Year Member



     
    Msg#: 25213 posted 5:22 pm on Aug 8, 2004 (gmt 0)

    If someone invented an x-ray camera that was able to take pictures of people and see through their clothes, unless these people had on two sets of underwear, who is responsible for making sure you aren't photographed?

    1. You, because these days no one should be stupid enough to wear just one set of underwear.

    2. The inventor, because for many decades people didn't have to wear two sets of underwear.

    HitProf

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 5:29 pm on Aug 8, 2004 (gmt 0)

    Well said, Scarecrow.

    rfgdxm1, I'm not talking about security but about common sense. Of course one shouldn't place the most sensitive data on the web - that's your ow resposibility.

    Much to my surprise I just found a reference to GoogleGuy's very first post on Webmasterworld and look what it's about (and read Brett's answer in msg # 30 as well:
    [webmasterworld.com...]

    renee

    10+ Year Member



     
    Msg#: 25213 posted 5:30 pm on Aug 8, 2004 (gmt 0)

    Scarecrow,

    somebody invented email. anybody can send you an email. you are now getting a lot of spam. who's fault is it?

    1. you because you are stupid enough to have an email address.
    2. the inventor of email.

    ridiculous reasoning, isn't it?

    GaryK

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 5:34 pm on Aug 8, 2004 (gmt 0)

    If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.

    renee

    10+ Year Member



     
    Msg#: 25213 posted 5:51 pm on Aug 8, 2004 (gmt 0)

    the issue is whether the protocol should be opt-in or opt-out for search engines.

    the protocol today is opt-out - meaning if you do not explicitly specify that your site or pages are not to be indexed, then you are implicitly giving your permisssion.

    the issue is whether the protocol should be changed. will it be better for the internet?

    this protocol is not a substite for security or privacy. whichever protocol you use there will always be spammers and hackers. we should continue and always be concerned with security. just like email spam - just because the accepted protocol now is opt-out, it does not mean that email spam will go away. but now we have reason to tell the perpetrators that you did not give your permission to receive spam.

    encyclo

    WebmasterWorld Senior Member encyclo us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 25213 posted 6:16 pm on Aug 8, 2004 (gmt 0)

    If you put a document online, do I have to contact you to get permission to look at it? Am I forced to use one particular way of accessing that document (eg. via a browser) or can I use other methods (eg. wget or another spider-like download tool)? How am I supposed to know what your restrictions are if you don't specify them? I can't contact every site in advance before clicking on a link - the web just doesn't work like that.

    If you want to put up information which is "restricted access" for friends and family, why are you putting it on a competely open system without defining any boundaries? If you stuck a notice on a tree in your street for your friends, how can you expect to keep it private from anyone else passing?

    Another analogy: you buy a field in the middle of a public park, and you don't put a fence, signpost or anything which allows someone to distinguish it from the public space that surrounds it. Can you blame people who inadvertently tresspass?

    I'll say it again: if you put it on the web, you are already giving explicit permission to access that document by any means. If you don't want unfettered access then you must use available tools like .htaccess and robots.txt to define the rules of access which correspond to your needs.

    I think there is some fundamental misunderstanding of the consequences of the technology here.

    digitalv

    WebmasterWorld Senior Member 10+ Year Member



     
    Msg#: 25213 posted 6:20 pm on Aug 8, 2004 (gmt 0)

    If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.

    That pretty much says it all. It's the freaking Internet, people - by its very nature everything is "public" unless defined otherwise by the server administrator. If you don't want anyone to see it, don't put it in a public place.

    europeforvisitors



     
    Msg#: 25213 posted 6:26 pm on Aug 8, 2004 (gmt 0)

    this will also reduce significantly the number of useless pages in the google index.

    I think it would have the opposite effect, because a great deal of useful information is created by academics and other subject experts who aren't into SEO, have never heard of robots.txt, and wouldn't know how to (or even that they should) opt in by taking any step beyond the obvious one of transferring their files to a public server.

    technoatheist

    10+ Year Member



     
    Msg#: 25213 posted 6:40 pm on Aug 8, 2004 (gmt 0)

    I'm confused by why this is an issue.

    Public websites are just that, public. The various (legitamate) spiders out there don't actively try to find information you haven't disclosed to anyone else. The whole argument that the public shouldn't be allowed into an area that is not cordened off from the public seems silly. The whole "X-ray" camera thing isn't even an argument, it's moot because legitimate engines don't do anything that sophisticated.

    If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.

    Look, the web is public, so is anonymous FTP and (frankly) unsecured email. Your privacy is ultimately your responsibility, not someone else's. If you have materials you wish to keep secure, then by all means secure them.

    Cripes, do you people wear your credit-card number on your t-shirts too?

    This 118 message thread spans 4 pages: 118 ( [1] 2 3 4 > >
    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Google / Google News Archive
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved