Why is Google indexing my entire web server?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why is Google indexing my entire web server?

google indexing

mumbles

12:10 am on Jul 17, 2007 (gmt 0)

I often put images on my web server for people to see as well as customer web sites while under construction. Since there were never any links to these pages the engines never listed them. Now suddenly I find they are all in Google. When did this start and why? I don't want to have to put things behind a password, it's not top secret but it also doesn't need to be in a search engine.

Sharper

4:40 am on Jul 17, 2007 (gmt 0)

Do a Google search for "robots.txt", then just use that to exclude directories and files that you don't want indexed by popular search engine robots.

voices

7:45 pm on Jul 18, 2007 (gmt 0)

I know about robots.txt and have rarely used it. I dont want to look through all my servers with 10 years worth of crap on them. I should, but I really have better things to do with my time. The point is, in the past things weren't spidered if they weren't linked to. So where and why is google picking this stuff up from?

Gibble

7:51 pm on Jul 18, 2007 (gmt 0)

Are you sure some other site hasn't linked to the files at some point?

jdMorgan

7:58 pm on Jul 18, 2007 (gmt 0)

Somebody, somewhere, linked to you. Or you visited *their* site from one of your pages, and ended up in their publicly-available "stats" page as a referrer link. Or maybe the rumor about the Google Toolbar invoking Googlebot visits is true... :)

At any rate, robots.txt will serve to block Gbot (and other compliant 'bots) from fetching resources from a server with only these three lines:

User-agent: *
Disallow: /

The last line --blank-- serves as the end-of-record marker. It's required by some old robots (or was).

Jim

theBear

9:07 pm on Jul 18, 2007 (gmt 0)

mumbles,

Just because you don't have an "http:..." pointing to something doesn't mean your server isn't serving up a directory listing. You need to remember that you also provide information to someone in order to have a domain resolve and that (insert search engine name here) also has access to that information.

A -Indexes on the Option statement in Apache can prevent this. I'd bet that there is something similar for the IIS folks.

Also some of the folks who you are allowing to look at things might also have posted links somewhere or are using a lees than trustworthy software on their systems.

Oh and then there are the proxies.

[edited by: theBear at 9:09 pm (utc) on July 18, 2007]

selomelo

10:30 pm on Jul 18, 2007 (gmt 0)

I have a website with just a few blog posts, and nothing else. I use this site for testing purposes only.

Once I tested a fetching script called "Wiki Reflection." It is a php script that fetches wiki content for user defined titles, such as: "fetchingScript.php?title=UNIX" will fetch the Wiki page on Unix, and display it.

Last week, I happened to check site:mysite.com, and viola! Google indexed 700+ pages with this pattern: fetchingScript.php?title=X.

There is no link to this script page, and it is absolutely isolated. So there is no possibility for Google to reach it by following a link pointing to that page.

The only possibility is that Google fetched the whole root directory (the script resides on the root directory).

mumbles

10:08 am on Jul 20, 2007 (gmt 0)

I know for a fact there are no links. This has happend on more than one server. One is a brand new site, on a new server. I have a front page up with no links that says under construction. Sent my client a link to get in the back door so they could view the site. I suspect it is the google toolbar and have removed it from my computer. I have been creating web sites since 1997, this has NEVER happened before. This should not be happening.

Some of these pages I have sent email links to people to view. Pages with pics for the family. When I redid my old website I saved the old files under a different name, like contactold.html instead of contact.html. All these old pages are now indexed. Thought google didn't like duplicate content? When I make updates on clients pages, I usually rename the old page to something else, cause sometimes they change their mind and want to go back and because I like to keep a backup. Google is out of control!

g1smd

8:09 pm on Jul 20, 2007 (gmt 0)

If DirectoryIndex is set to on, then Google will easily see all the files in one go.

They are all listed in plain sight.

Robert Charlton

1:50 am on Jul 21, 2007 (gmt 0)

I know for a fact there are no links. This has happend on more than one server. One is a brand new site, on a new server. I have a front page up with no links that says under construction. Sent my client a link to get in the back door so they could view the site....

mumbles - As several people have tried to tell you in several ways, when you let pages sit on your server and don't block them, they're liable to get spidered.

Jim Morgan's scenario is likely enough that I'm going to repeat it and empasize the relevant text...

...Or you visited *their* site from one of your pages, and ended up in their publicly-available "stats" page as a referrer link.

I always have developers I'm working with either use a password protected directory, or put the meta robots noindex tag in all temporary or development pages that are in a public directory...

<meta name="robots" content="noindex, nofollow">

Google indirectly suggests that it's not the Toolbar in this "Google Information for Webmasters FAQ"...

[google.com...]

Why is Googlebot downloading information from our "secret" web server?
It's almost impossible to keep a web server secret by not publishing any links to it. As soon as someone follows a link from your "secret" server to another web server, your "secret" URL may appear in the referrer tag and can be stored and published by the other web server in its referrer log. So, if there's a link to your "secret" web server or page on the web anywhere, it's likely that Googlebot and other web crawlers will find it.

So, I don't think it's Google that's out of control. When you leave old files in a public area on your webserver, they're public. At the least, create a folder, call it "old", block it somehow, and put your old stuff in there.

Note:
Google has apparently moved its relevant explanation on this topic. New location here...
Why is Googlebot downloading information from our "secret" web server?
[scholar.google.com...]

[edited by: Robert_Charlton at 3:22 am (utc) on April 20, 2008]

g1smd

5:58 pm on Jul 21, 2007 (gmt 0)

The robots.txt "block" still allows the URLs to appear as URL-only entries in the SERPs.

The noindex meta tag, or proper password protection, is much more robust.

theBear

7:33 pm on Jul 21, 2007 (gmt 0)

Let me add the following provided by the party who has the issue:

"Sent my client a link to get in the back door so they could view the site."

Yep I once did that, the client was running a tool bar (I won't say whose tool bar) the next thing I knew there were bots visiting.

Robert Charlton

8:57 pm on Jul 21, 2007 (gmt 0)

The robots.txt "block" still allows the URLs to appear as URL-only entries in the SERPs.
The noindex meta tag, or proper password protection, is much more robust.

Yes, the noindex meta tag is much better than robots.txt to keep references to your URLs from being indexed.

One further point on this...

robots.txt is to prevent spidering. The noindex meta tag is to prevent indexing. Don't use robots.txt in addition to the noindex meta tag to prevent indexing of a particular page or pages.

Because the use of robots.txt will prevent Googlebot from spidering the page with the noindex meta, Google will never see the noindex meta and therefore might end up indexing the URL to the robots.txt-blocked page if it finds it somewhere on the web.

mumbles

9:11 am on Jul 23, 2007 (gmt 0)

This is true now, but for 10 years it has not been a problem. I wanted to bring attention to the fact since others may have been lazy like me and kept private files in public places for years. Seems you can't do this anymore.

As several people have tried to tell you in several ways, when you let pages sit on your server and don't block them, they're liable to get spidered.

g1smd

11:55 am on Jul 23, 2007 (gmt 0)

This has been true for at least several years.