Welcome to WebmasterWorld Guest from 54.226.133.245

Forum Moderators: open

Message Too Old, No Replies

Google Me Not

Staying off the Radar or Trashing the Cache

     
12:48 pm on Aug 8, 2004 (gmt 0)

Administrator from US 

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 21, 1999
posts:38048
votes: 12


[forbes.com...]

Staying off the radar.

Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.
1:38 pm on Aug 8, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 30, 2002
posts:1377
votes: 0


I think Google and other search engines are stepping over a line when indexing non web file types. Indexing of those documents should be opt in. A spider doesn't have the right to open my post or even read it even if I left it lying open on the table.

File types never meant to be made public (doc, xls, ppt) should only be indexed when indicated. Only files meant for the web (html, php, asp, xhtml, etc) should be indexed freely. Some are ambiguous, but most are not.

1:51 pm on Aug 8, 2004 (gmt 0)

Full Member

10+ Year Member

joined:Mar 7, 2004
posts:285
votes: 0


I saw this article last week in the magazine and the thing that amazed me the most was that there are now articles about how to stay out of Google.

Hitprof - great analogy about the mail. I guess that PDFs would be one of the more ambiguous file types.

1:57 pm on Aug 8, 2004 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 4, 2002
posts:1314
votes: 0


Unfortunately, hitprof, there is no such thing as non web file types

There are only files that are on the web -- i.e. have a link to them -- And those that aren't.

If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:

  • putting the file in a folder banned by robots.txt
  • have the linking page have a robots meta for noindex and/or nofollow amd/or noarchive.

    Google will honor those instructions. Other search engines and most humans won't. For real protection, put anything you don't the world to see behind a password-required barrier. Or not on the web at all.

  • 1:59 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Dec 7, 2003
    posts:788
    votes: 0


    Any personal files transferred anywhere should be encrypted, but it might only be a matter of time before SE's start using super-computers to break into even those.
    2:22 pm on Aug 8, 2004 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Aug 31, 2003
    posts:9063
    votes: 2


    It seems more a case of blaming Google and other search engines for one's own lax security procedures. We are talking about publically-accessible web servers here. The idea about "non-web" documents won't work either - if you place a document in a public place, you should expect it to be read - by search engine bots, curious visitors, crackers and the rest. There is no reason why a Word document, a PDF file or anything else shouldn't be indexed when the person has chosen to make it available to all.

    It is up to the file owner to decide whether a file should be secured against public viewing or not - a search engine bot cannot make such a distinction.

    2:59 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Jan 30, 2004
    posts:260
    votes: 0


    I donít think this is anything particularly new to regular readers of this forum. Surely most here have typed robots.txt in the url of a competitor and stats or weblog ect.

    I agree that the average user has no idea how powerful the search engines are and what type of information can be obtained. If it is in a public server you have to assume it will be accessed by someone or something.

    3:02 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    i think the protocol should be changed to opt-in, not opt-out - just like email lists.

    if i left my house door open, it does not mean i am giving permission for everybody, anybody to come in and do what they please with my property.

    3:09 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Mar 22, 2001
    posts:2450
    votes: 0


    i think the protocol should be changed to opt-in, not opt-out - just like email lists.

    It is opt-in. If you opt-in, you make your document accessible. If you don't want it indexed, don't make it accessible. Sounds simple enough to me.

    3:18 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    no. opt-in means you give explicit permission to be included in the index. opt-out is the current protocol whereby you give explicit instructions to be excluded through the robots.txt and the meta-tags.

    this will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.

    3:35 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:May 16, 2003
    posts:992
    votes: 0



    (opt-in) will also reduce significantly the number of useless pages in the google index. look at all the server stats that are indexed by google that has no use except to the server's owner.

    Although some of these webstats are explicitly linked by site owners, I think Google would be doing us all a favour by making webstat-type pages opt-in only. Not only would it squelch a lot of spam, it would make searching for information on robots and user-agents a whole lot easier.

    3:37 pm on Aug 8, 2004 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Aug 31, 2003
    posts:9063
    votes: 2


    Opt-in is when you explicitly place documents on a public web server - in doing so, you are indicating that such documents are for the public consumption, and therefore the search bots are completely correct in indexing them. If they are not for public viewing, what are they doing there? To reuse your analogy, you are opening your house to the public for viewing (not for stealing) without any restriction on who can enter. If there are rooms you don't want people to view, then you need to lock the doors. If there are certain visitors you don't want at all, then you need someone controlling access at the front door (.htaccess, robots.txt).

    If documents are private, then don't publish them. If your network is insecure, then it is your responability - a search bot cannot be expected to distinguish between an intentionally-public server and an inadvertently-public server.

    3:43 pm on Aug 8, 2004 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Aug 31, 2003
    posts:9063
    votes: 2


    making webstat-type pages opt-in only

    1. How can the search bot tell that the page is a web stats page (which you think shouldn't be indexed) and not an example by the sales team of a stats program (which they would want indexed), or any other table of text and figures? The bots can't and don't read, analyse or understand - a web stats page is just the same as any other HTML page.

    2. I regularly visit a site which has an explicit link to their stats page in their main menu. They want their stats to be public (for whatever reason). Why should the search engines refuse?

    The makers of the stats programs could add a robots meta tag to their generated pages by default, but how do they know that they are doing what their customers want (see 2.)? Furthermore, why should they be responsible any more than the search engines?

    3:43 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:May 30, 2003
    posts:932
    votes: 0


    if i left my house door open

    Renee, I don't agree. It's not your house door. Your "house" is your personal computer. A webserver is on the World Wide Web. If you deliberately leave your personal documents on a table in the public library then you are by default saying they're public, which would be different to leaving them in the private care of the librarian.

    3:55 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    by your defintion, an email address published in a web page is fair game to any and all spammers. this is the reason that the concept of opt-in and opt-out came about. this is simply giving explicit permission for a third party to utilize/incorporate what belongs to me. no explicit permission does not mean permission. this is a matter of protocol. i'm not disagreeing that this is the current protocol. i'm just proposing that it should be changed.

    if my house door is open, anybody can look but anybody who takes anything without my permission is stealing. And the thief cannot use the excuse that my door was open.

    3:58 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:June 16, 2003
    posts:1298
    votes: 0


    >>if my house door is open, anybody can look

    Here you go. They are just taking a photo of your house (cache) not taking the actual object (i.e. your pages still exist in your server - your house)

    4:04 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member kaled is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Mar 2, 2003
    posts:3710
    votes: 0


    If search engines are going to display cached pages, they should support an enhanced robots.txt standard. The following would be ok

    user-agent: *
    disallow:
    disallow-cache: /images/
    disallow-cache: /presentations/

    Note that you cannot use meta tags to prevent caching of images etc.

    disallow-archive: would be a sensible synonym.

    Kaled.

    4:04 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    chndru,

    are you giving permission for anybody to take pictures of you doing private things in your house and publish them in the web and all media? and they'll say you're fair game - you did not draw down your window drapes!

    4:37 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Aug 30, 2002
    posts:1377
    votes: 0


    I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).
    5:04 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member rfgdxm1 is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:May 12, 2002
    posts:4479
    votes: 0


    >I fully agree with renee. Many people place documents on the web for themselves and for friends. That's not the same as for the rest of the world. Why would anyone bother to find it? Why should anyone find it? In a normal search I mean, I'm not talking about "burglers" (hackers e.g.).

    And if these documents are sensitive such they wouldn't want the world to know what is in them, they are *stupid*. Googlebot isn't into hacking. It just follows links. If someone puts documents on a web server with links SE bots can find, this isn't just security by obscurity. It's security based on stupidity. DON'T DO THIS!

    5:19 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Jan 31, 2004
    posts:710
    votes: 0


    If it is on the web, then it is fair game for a search engine -- unless you tell the engine differently by:
    That's very false, I shouldn't have to tell them not to come to my site! Why should I have to create a robots.txt page and waste more space on my server? I don't care if they come to my site, but when someone says it's fair game than you better think again, I created my site, not Google, Yahoo! or the new MSN!
    5:22 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Jan 13, 2004
    posts:208
    votes: 0


    If someone invented an x-ray camera that was able to take pictures of people and see through their clothes, unless these people had on two sets of underwear, who is responsible for making sure you aren't photographed?

    1. You, because these days no one should be stupid enough to wear just one set of underwear.

    2. The inventor, because for many decades people didn't have to wear two sets of underwear.

    5:29 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Aug 30, 2002
    posts:1377
    votes: 0


    Well said, Scarecrow.

    rfgdxm1, I'm not talking about security but about common sense. Of course one shouldn't place the most sensitive data on the web - that's your ow resposibility.

    Much to my surprise I just found a reference to GoogleGuy's very first post on Webmasterworld and look what it's about (and read Brett's answer in msg # 30 as well:
    [webmasterworld.com...]

    5:30 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    Scarecrow,

    somebody invented email. anybody can send you an email. you are now getting a lot of spam. who's fault is it?

    1. you because you are stupid enough to have an email address.
    2. the inventor of email.

    ridiculous reasoning, isn't it?

    5:34 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Sept 17, 2002
    posts:2251
    votes: 0


    If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.
    5:51 pm on Aug 8, 2004 (gmt 0)

    Full Member

    10+ Year Member

    joined:Nov 25, 2002
    posts:207
    votes: 0


    the issue is whether the protocol should be opt-in or opt-out for search engines.

    the protocol today is opt-out - meaning if you do not explicitly specify that your site or pages are not to be indexed, then you are implicitly giving your permisssion.

    the issue is whether the protocol should be changed. will it be better for the internet?

    this protocol is not a substite for security or privacy. whichever protocol you use there will always be spammers and hackers. we should continue and always be concerned with security. just like email spam - just because the accepted protocol now is opt-out, it does not mean that email spam will go away. but now we have reason to tell the perpetrators that you did not give your permission to receive spam.

    6:16 pm on Aug 8, 2004 (gmt 0)

    Senior Member from CA 

    WebmasterWorld Senior Member encyclo is a WebmasterWorld Top Contributor of All Time 10+ Year Member

    joined:Aug 31, 2003
    posts:9063
    votes: 2


    If you put a document online, do I have to contact you to get permission to look at it? Am I forced to use one particular way of accessing that document (eg. via a browser) or can I use other methods (eg. wget or another spider-like download tool)? How am I supposed to know what your restrictions are if you don't specify them? I can't contact every site in advance before clicking on a link - the web just doesn't work like that.

    If you want to put up information which is "restricted access" for friends and family, why are you putting it on a competely open system without defining any boundaries? If you stuck a notice on a tree in your street for your friends, how can you expect to keep it private from anyone else passing?

    Another analogy: you buy a field in the middle of a public park, and you don't put a fence, signpost or anything which allows someone to distinguish it from the public space that surrounds it. Can you blame people who inadvertently tresspass?

    I'll say it again: if you put it on the web, you are already giving explicit permission to access that document by any means. If you don't want unfettered access then you must use available tools like .htaccess and robots.txt to define the rules of access which correspond to your needs.

    I think there is some fundamental misunderstanding of the consequences of the technology here.

    6:20 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    WebmasterWorld Senior Member 10+ Year Member

    joined:Oct 15, 2003
    posts:1418
    votes: 0


    If you want to use your website to access personal, private documents via the web without risking them being indexed by search engines then put them in a password-protected folder.

    That pretty much says it all. It's the freaking Internet, people - by its very nature everything is "public" unless defined otherwise by the server administrator. If you don't want anyone to see it, don't put it in a public place.

    6:26 pm on Aug 8, 2004 (gmt 0)

    Senior Member

    joined:Oct 27, 2001
    posts:10210
    votes: 0


    this will also reduce significantly the number of useless pages in the google index.

    I think it would have the opposite effect, because a great deal of useful information is created by academics and other subject experts who aren't into SEO, have never heard of robots.txt, and wouldn't know how to (or even that they should) opt in by taking any step beyond the obvious one of transferring their files to a public server.

    6:40 pm on Aug 8, 2004 (gmt 0)

    New User

    10+ Year Member

    joined:June 12, 2003
    posts:12
    votes: 0


    I'm confused by why this is an issue.

    Public websites are just that, public. The various (legitamate) spiders out there don't actively try to find information you haven't disclosed to anyone else. The whole argument that the public shouldn't be allowed into an area that is not cordened off from the public seems silly. The whole "X-ray" camera thing isn't even an argument, it's moot because legitimate engines don't do anything that sophisticated.

    If folks insist on the house metaphor, leaving unsecured open documents on an equally unsecured server is like dancing naked in your livingroom with the curtains open and a freeway driving by outside, and then being "shocked" that someone talked about it.

    Look, the web is public, so is anonymous FTP and (frankly) unsecured email. Your privacy is ultimately your responsibility, not someone else's. If you have materials you wish to keep secure, then by all means secure them.

    Cripes, do you people wear your credit-card number on your t-shirts too?

    This 118 message thread spans 4 pages: 118