Google Me Not

Forum Moderators: open

Message Too Old, No Replies

Google Me Not

Staying off the Radar or Trashing the Cache

Brett_Tabke

12:48 pm on Aug 8, 2004 (gmt 0)

[forbes.com...]

Staying off the radar.

Search engines can store results in their "cache" for between a month and forever. As archiving improves, it will get harder to clean up what's been revealed. Rarely are leaks intentional: Somebody at work might post a file on a server to download at home, a wrongly configured server might make too much of a hard drive searchable or a Web site's password-protection might be flimsy enough to be accessible to search engines.

StupidScript

5:29 pm on Aug 9, 2004 (gmt 0)

Quickly: We are also having trouble with the Yahoo bot spidering our httpd.conf file to come up with domain names we haven't published, yet.

How would anyone propose protecting the data in THAT directory, which has no links to it and it traditionally considered to be sensitive? .htaccess files? robots.txt files? It's not even "in" the web "server" section of the computer!

digitalv

5:31 pm on Aug 9, 2004 (gmt 0)

spidering our httpd.conf

Why would this ever be publically accessible?

greyhat

5:38 pm on Aug 9, 2004 (gmt 0)

StupidScript, how can Yahoo get to your httpd.conf file? If it's not in the "Web Server" section of your computer, why on earth is your web server allowing user-agents to request it?

BigDave

5:38 pm on Aug 9, 2004 (gmt 0)

Google cannot access your .htaccess or your httpd.conf file unless you did something insanely stupid with your setup.

No one was saying to use .htaccess to "tell" google not to go somewhere, they were suggesting that you use it to block googlebot.

StupidScript

5:41 pm on Aug 9, 2004 (gmt 0)

Right. I'm agreeing with you three. So why did Yahoo spider it? Where's the "key" I can use to keep them out if they decide to do it, even when locked?

(PS: BigDave...you imply that there is something I did that allowed the spider into that directory. Silly me. What might that have been?)

And doing a search for "x filetype:htaccess" on Google turns up hundreds of such files. There are obviously indexing them.

kaled

5:45 pm on Aug 9, 2004 (gmt 0)

I think Brett misses the days of the super-long dance threads. His choice of hot potato to fire up everyone was inspired.

Kaled.

technoatheist

5:48 pm on Aug 9, 2004 (gmt 0)

Uhm, StupidScript, you have an error in your server configuration.

.htaccess is a special file that the Apache web server will not distribute. It is used by Apache to configure attributes for that directory. If you can read that file, you may have an error in your configuration, or you may not be running Apache.

In addition, you should NEVER store your *.conf files in the document root portion of your directory, or if you must do so due to a bad host configuration, I'd configure your server to deny access to *.conf files and go find a proper hosting company.

As for the rest; Look, the search engines don't cache every single page, in fact they only cache a small percentage of pages they feel are important. Likewise, most don't crawl the entire site, but stop when the page becomes overly complex or questionably formatted.

If you want to "protect" your material, simply do things like attach long meaningless arguments to the url (to make it look like a session key) or route all your pages through a redirector. Your page rank will drop like a rock and you won't have to worry about any of these problems.

StupidScript

5:49 pm on Aug 9, 2004 (gmt 0)

My point (perhaps lost in my babbling) is in support of renee.

There should be SOME kind of control and rule structure that ALL bots MUST follow if we are ever to be secure in the knowledge that files placed on a computer with a NIC will be secure from roving bands of spiders that seek to enhance their bottom line by placing copies of our electronic documents in a public area.

BigDave

5:51 pm on Aug 9, 2004 (gmt 0)

And doing a search for "x filetype:htaccess" on Google turns up hundreds of such files. There are obviously indexing them

And did you look at the results? None of them were actual .htaccess files. They were all files about .htaccess or badly done 404 pages that return 200.

I don't know what you cold have done wrong to allow access, because I cannot imagine doing such a thing. But if it is actually getting instead of just asking for it, then something is wrong.

StupidScript

5:53 pm on Aug 9, 2004 (gmt 0)

technoatheist: I understand all of this and have been very careful with every server I have managed over the past 14 years.

I am running Apache on an RH9 dedicated server. The web root directory is (by default) /var/www and the config directory is (by default) /etc/httpd/conf. I don't know how Yahoo got in there unless the bot was programmed to adopt a fake root persona and crawl back up the directory tree. Frankly, it's quite a mystery to me.

StupidScript

5:54 pm on Aug 9, 2004 (gmt 0)

BigDave: There are actual .htaccess files in there.

vabtz

6:03 pm on Aug 9, 2004 (gmt 0)

Yet another case of blaming Google because someone doesn't know how to do their job.

greyhat

6:12 pm on Aug 9, 2004 (gmt 0)

StupidScript, I also cannot see any actual .htaccess files on Google--only example files called, for example, old.htaccess, which is NOT a .htaccess file. I also see a number of .htaccess files from CVS repositories, but those are not functional and seem to all have things appended to the name, like "?rev=1.2". By default Apache will forbid any requests for ".htaccess" (even if it's not there).

I really don't know what you mean by "adopt a fake root persona." It simply should not be possible for anyone to access files in /etc over the web if your public files are in /var/www. And I don't see how this relates to search engine spidering. Are you saying that you should be able to tell spiders not to crawl files that your web server is not supposed to be serving just in case the server itself is configured wrong or insecure?

StupidScript

6:14 pm on Aug 9, 2004 (gmt 0)

Hey, I have no gripes with Google. I simply don't EVER put anything I wouldn't want to share with the world on any system that has a web server installed...running or not.

And as I'm always in the market to learn new things, please inform us of what we should do with an installation of Apache on RH9 to keep every rogue spider from probing our innermost secrets.

I'm looking for a comprehensive solution, here, not the superficial (well...use .htaccess and robots.txt and don't have any authorized users and keep the hackers out, too) type of response. How do you do YOUR job?

Seriously. I'll cop to not being the uber-security guy. Help! :)

BigDave

6:21 pm on Aug 9, 2004 (gmt 0)

Did they actually *successfully* get a copy fo your httpd.conf file?

Not "did they ask for it", but did your system send it back out?

I can put together a GET to ask your server to send me GoogleGuy's home phone number, but that doesn't mean that it can or will supply it to me.

StupidScript

6:31 pm on Aug 9, 2004 (gmt 0)

Referring to Yahoo's bot "using" information in my httpd.conf file: The company I work for owns 109 domains. We actively use 14 of them. Out of the rest, 73 use similar domain names (misspellings, different hyphenation, etc.) and 301 redirect to the related primary 14 domains.

This leaves 22 domains which we own, and have registered, but do not "use". ALL of the domains are listed as Virtual Domains in our httpd.conf file.

A short while back, as I was doing SEO research, I did repeated searches for link:yaddayadda.com (our active domains) to confirm my efforts at gaining inbound-links.

Lo and behold...ALL of our domains turned up in the searches...repeatedly. Not only that, but the TITLEs and descriptions displayed from Yahoo's index were being used on every domain, depending on which domain I was checking link: for. I.e. yadda.com showed yaddayadda.com's TITLE, etc.

Even the domains that have no home directory, no redirect, and no links-in (heck...there are no pages to link TO!) were displayed in the search results.

Clearly there is a problem with the Yahoo index, but more clearly...they had gotten hold of our httpd.conf file...the only place on earth where the un-used 22 domains were listed apart from the registries.

After investigating my brains out, I am left with the conclusion that Yahoo's bot indexed my httpd.conf file. Perhaps I am in error? I've come to the (perhaps erroneous) conclusion that this was done intentionally (although I could not tell you how) in order to crack down on PPC advertisers who use different domains for redirects and such. I simply do not know. But that's what happened.

BigDave

6:40 pm on Aug 9, 2004 (gmt 0)

the only place on earth where the un-used 22 domains were listed apart from the registries.

I think you just answered your own question.

StupidScript

6:49 pm on Aug 9, 2004 (gmt 0)

So you think Yahoo got their info from the registrars and associated that data with our active domains and placed the (mixed up) results into their index?

Wow. That's a neat trick. Even associating domains that have not been added to our nameservers by virtue of the registry entries? Whew.

Thanks! :)

I guess the lesson for renee and myself is: go ahead and get yourself a domain, but if you want to keep sensitive material on the machine that hosts that domain...don't register the domain name with a registrar. In fact...don't bother with the domain, just use the IP address...I guess.

BigDave

7:24 pm on Aug 9, 2004 (gmt 0)

Yeah, it is a neat trick. but nowhere near as neat as getting a file that they cannot possible get to through normal channels.

You are suggesting that a major search engine is taking advantage of a crack to get at nonpublically accessible files. That is simply not happening.

Show me the get that they used to get your httpd.conf. Where is the log entry.

Oh yeah, it's on a virtual server, which domain did they use to get the conf file? Or did they request it by IP address?

Your only "proof" that this happened, as far as I can tell, is that yahoo spidered pages that you don't know how they got them.

Are you absolutely sure that they are not on the NS? Have you done a lookup?

Is anyone that is supposed to have access running the yahoo toolbar?

Have you tried searching the web for those domain names?

Until you show me the GET, I will remain convinced that you are wrong, and jumping to totally wild conclusions.

If you don't know how to read your log files, it's time you learned.

StupidScript

7:36 pm on Aug 9, 2004 (gmt 0)

'Nuff said. I take it back. :)

Yahoo's bot is probably behaving properly, and it must have gone down like you said. Regardless of the fact that my web server could not have delivered the httpd.conf file as it is nowhere in the "public" web server directories and therefore would not have been (and was not) logged, your explanation, although weird, is more likely than my infiltration scenario.

I know they are not in our nameservers because I didn't put them there. I just checked, and I still didn't put them there.

I do appreciate having a plausible explanation.
Sincerely, BigDave.

I remain committed to not placing sensitive material on a web-connected appliance, though. It just seems prudent.

As Sun Microsystems Chairman Scott McNealy is quoted as saying in the Forbes article quoted at the start of this thread, "You already have zero privacy. Get over it." (Amusingly, the banner ad playing at the Forbes site states "The scariest thing about a spider...is not having one in your portfolio.")

digitalv

8:22 pm on Aug 9, 2004 (gmt 0)

To get back on topic here ...

There should be SOME kind of control and rule structure that ALL bots MUST follow

There isn't and never will be - robots.txt is the closest we'll ever get to it. Just because a rule/law/whatever exists doesn't mean I can't write a program that ignores it. The only rules my home-made spider has to follow are the ones YOU set when you set up your server's security.

It's kinda like saying there should be a rule that no one can come into your house without your permission ... we already have that too, but it doesn't mean you shouldn't have a lock.

Speaking of which, how come convenience stores that are open 24/7/365 have locks on the doors? :)

GaryK

9:45 pm on Aug 9, 2004 (gmt 0)

Speaking of which, how come convenience stores that are open 24/7/365 have locks on the doors?

To keep out spiders if they see one about to come through the door?

dsandall

3:56 pm on Aug 12, 2004 (gmt 0)

..That's very false, I shouldn't have to tell them not to come to my site! Why should I have to create a robots.txt page and waste more space on my server?

Following that logic, and picking up on the open house analogy being bantered about, here's my thinking...

does your home address show up on a map, or even an arial photo of the region you live in? So, in other words, you are proposing that people should not be able to see your house's PUBLIC exterior unless you give them explicit permission to look at it?

Humans have been trying to map their environment for as long as there has been written history, indexing the web is an extension of that inner urge to map out and understand our environment. Does anyone remember the Henderson directories? (or maybe that was just a Canadian thing?). They were street by street listings of who lived where. 123 Main street was Bob Jones, 125 Main was... etc. The robots file equivalent was to not give your info when they came to the door to ask. But, was there any moral, legal or other reason to prevent them from coming to your public door and knocking? no. not at all.

Was there anything to prevent the Henderson researchers from opening your door, walking in and reading your mail, diary or rooting around in your fridge to see what you ate? Yes there was! There was a lock on the door, a law or two that would make it a criminal offense to do so, and the general moral code that prevented it.

Now for the stick part of it... If you had a big sign on your lawn "the Jones Family lives here" is that public? can they write that down and publish it their directory? Probably yes. Now, what if they left some pictures of an office party out on the lawn? Can they take a picture or copy of them and publish them? Dicey at best. But, the neighbors could definately see them and have a few thoughts of their own.

Now, how does the spider (or the door-to-door directory researcher) find those pictures laying on the lawn? They have to be there. If the gate has a big sign forbidding interlopers and is locked, (ie. passwords and robots.txt) then there is no way they should be seen by them. If there is no fence, (ie. a link to them) then, there is a chance they will be seen if the spider comes along at the right time.

Long winded, but I hope I made the point.
Dwayne

ergophobe

6:49 pm on Aug 17, 2004 (gmt 0)

This may be the most interesting thing I've ever seen here (seriously). I would never have imagined that people who build and optimize websites would think that indexes should be opt-in (the cache question, which started this all off, is a slightly different issue).

Maybe it's been done to death and nobody seems convinced, but FWIW

1. Your web space is not your house, your car, your backyard or your email address, it is a place where you publish documents.

If I publish a book, do libraries have the right to index that book and then keep a copy? Yes. Do they infringe on my copyright by "caching" a copy in the library? No. Does a patron who comes along and photocopies large sections infringe on my copyright? Probably. If libaries didn't "cache" copies for public use in addition to indexing them in their online or card catalog, why would anyone use libraries? How would they have caught on in the first place as one of the most important institutions in the history of humanity? If I don't want my book to end up in the library, I don't publish it. Putting something in webspace is publishing.

2. How does Google differ from a library?
Because it profits from cached content? So does every major university library. Without a huge amount of "cached" content, nobody in my field (history) will consider going to your graduate program, and the best scholars (those with other offers) will not teach at your institution. Hey Harvard and Stanford didn't even ask me whether they could put my books on the stacks and now they're using them to attract grad students. Not only that, the stupid errors I made in my first book are available to anyone now. In my case, sales are low enough that I probably could handle opt-in requests from every library that wants to cache my books, but I think I like the current system (which does not even allow opt-out I'll have you know).

3.Not everyone wants to be a webmaster

Most people put their stuff on the web because they want it available to the public. Otherwise, it would still be on their non-networked hard drive. A huge percentage of these people, many of whom are creative people with tons of information to share, have no desire to learn SEO and how to get on the search engines. If indexing were an opt-out service, we would all be poorer for it.

Bonus analogy

If I leave the keys in my car, and aliens from Alpha Centauri can start and drive the vehicle using telepathy without getting into the driver's seat, does that mean that highway billboards be opt-in only?

HitProf

11:32 am on Aug 18, 2004 (gmt 0)

I would never have imagined that people who build and optimize websites would think that indexes should be opt-in

I very much doubt the type of information in question is found on a normal web page.

Google wants to index all types of information, not even limited to the web. The question is: do we all agree with that, or do we draw a line somewhere?

My personal opnion is that search engines should stick to web pages (Flash included, Flash is made for the web!), unless the owner specifies otherwise.

digitalv

1:25 pm on Aug 18, 2004 (gmt 0)

search engines should stick to web pages (Flash included, Flash is made for the web!)

Just a comment first, I would personally rather see Flash not indexed over all of the other types out there. The reason is that Flash docs could contain text, keywords, etc. but actually do nothing more on the page than show a logo. That kind of crap will be used and abused by search engine spammers.

But that aside, don't you think restricting document file types goes against everything that the web has or will become? HTML was only created because an efficient/universal method of delivering documents and linking to other documents beyond text files didn't exist. That isn't the case today - PDF, Word, Excel, all of them now have built-in web capabilities. They can contain clickable hyperlinks, they can share data (import/export) with other web-only formats like HTML and XML.

Maybe once a long time ago they could have been considered "non web file types" but that doesn't hold true anymore. Today they are all web file types. I'm just amazed that there are actually webmasters out there trying to limit growth like this.

ergophobe

3:07 pm on Aug 18, 2004 (gmt 0)

But that aside, don't you think restricting document file types goes against everything that the web has or will become?

Exactly. Think of what's going on the web. The problem is not that too many file formats are being indexed, but too few - there is no decent way to search for images, movies and so on. Word or at least RTF documents are a perfectly acceptable way to put information on the web.

To continue my earlier analogy - when libraries were created, they only had copies of codices (manuscripts in codex form). Was there a controversy when they started collecting and indexing books, then printed materials, then newspapers and periodicals, then images, then sound recordings, etc etc etc?

I very much doubt the type of information in question is found on a normal web page.

That's not the point I was making. I was trying to say that any web developer who knows anything should realize that if you put a document in your web root, it has been published. That is why the web exists. If you don't want it published, keep it out of your web root and Google will not go there. I can see that this might be lost on non-techies, but I'm amazed that this has been lost on the readers of this forum.

And as many have pointed out, please define a normal web page? Is it my html? My Word documents? My images (gif, jpeg, png)? My videos? My proprietary database? My PDF files? So far I can answer yes to all of these. If I didn't want them indexed, they would be outside of web root and password-protected.

Also, let's be clear here, Google does not search your hard drive, Google follows links. So even if you're completely lax about security and you have private docs in web root and unencrypted, pure laziness would keep them out of Google by virtue of the fact that there are no incoming links.

Tom

HitProf

5:22 pm on Aug 18, 2004 (gmt 0)

>any web developer who knows anything should realize that if you put a document in your web root, it has been published.

There you tackle the problem. They should know but they don't. And many many many hobby publishers are by no means web developer. They only want to share "something" with their friends.

It's not too much asked that those that are savvy (the ones that really should know) have a means of saying: hey, this .something is OK to index, in stead of demanding that all those clueless hobby webmaster have to do something special to prevent it.

This 118 message thread spans 4 pages: 118

Google Me Not

Staying off the Radar or Trashing the Cache

Brett_Tabke

StupidScript

digitalv

greyhat

BigDave

StupidScript

kaled

technoatheist

StupidScript

BigDave

StupidScript

StupidScript

vabtz

greyhat

StupidScript

BigDave

StupidScript

BigDave

StupidScript

BigDave

StupidScript

digitalv

GaryK

dsandall

ergophobe

HitProf

digitalv

ergophobe

HitProf

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week