Googlebot Indexing Anything it Finds

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot Indexing Anything it Finds

Google needs to stop!

atlrus

1:01 pm on Jul 6, 2006 (gmt 0)

I did a site: search and what do I see? My FTP log is cached by Google...and there has never been a link to it, ever!

I have a few more pages like that cached by Google - image directory, articles directory - pages with no links to them, pages and files I did not know existed...

Obviously Google is just crawling everyhting on my website's directory - pages and files and all.
To me - this is a privacy invasion.

Anyone else see this with their sites?

cheesehead2

3:20 pm on Jul 6, 2006 (gmt 0)

It is your fault. Anything that is private should be password protected. Google wouldn't have found it if there was not a link.

ogletree

3:23 pm on Jul 6, 2006 (gmt 0)

Cheesehead2 that is not true Google has many ways of finding things. You need to tell Google not to index stuff you don't want found. Anything that is available to the public is fair game for Google to find.

g1smd

9:11 pm on Jul 6, 2006 (gmt 0)

Really funny. Just minutes ago I read this other post while looking for somethig else:
use.perl.org/~gabor/journal/18601.

colin_h

1:04 am on Jul 7, 2006 (gmt 0)

lol ... Someone is after us ;-)

Lookout Googlebot is behind you ...

theBear

1:22 am on Jul 7, 2006 (gmt 0)

It is also possible that your server provided google with a complete directory listing for the directory your ftp-log was in.

ansible

2:41 am on Jul 7, 2006 (gmt 0)

If you don't want certain stuff of yours to be indexed, you need to disallow it in your robots.txt file.

It sounds like you were using the google toolbar when your ftp log was pulled up on your end, and you don't have it disallowed in your robots.txt file.

When you use the google toolbar and have the pagerank bar enabled, it sends url data to google so.. so google knows what urls exist out there, even if they are not linked to anywhere. So you have to be careful about what links you pull up when the pagerank bar is enabled.

Komodo_Tale

6:15 am on Jul 7, 2006 (gmt 0)

<?php
if ($GoogleBot == "PacMan"){
echo "<h1>" . "Run for your lives!" . "</h1>" ;
}
?>

Lexur

7:06 am on Jul 7, 2006 (gmt 0)

There is some checkbox to disable Apache serving directory's index if there's no an index HTML file.

followgreg

7:25 am on Jul 7, 2006 (gmt 0)

Also can use .htaccess to prevent from listing directories

Romeo

11:59 am on Jul 7, 2006 (gmt 0)

... and if you can't poke around in your apache's config and are unsure with the .htaccess, just put an index.htm (may even be empty) into those directoriews where there isn't one yet.

Disallowing in robots.txt may keep the Googlebot out but may attract other bots (the bad ones) and individuals to explicitly search for stuff you just tagged 'interesting' this way.

Kind regards,
R.

atlrus

1:28 pm on Jul 7, 2006 (gmt 0)

Thanks for the response, but it's not that simple.
I use WS_FTP, and obviously this program will leave a file WS_FTP.log in my main directory showing the last pages transferred. It seems to be a temporary .txt file, it stays for a day and then it's gone.

Now, I HAVE NEVER seen this file before I saw it on Google...There are no link, NO LINKS, pointing to this file.

Also, I decided to dig a little deeper today, and I found some crazy sh*t - suplemental parent directories cached, for example:

domainname.com/foldername/?C=D;O=A
domainname.com/foldername/?C=S;O=A

with the following snippet from the Google's description of the pages(pagename and foldername are replacements):

LOG 15-Feb-2005 23:11 588 pagename.htm 15-Jul-2005 11:50 68K foldername/ 07-Jul-2005 11:05 -. Apache/2.0.46 (Red Hat) FrontPage/5.0.2.2634 mod_ssl/2.0.46 ...

Now, I am no programer - I design the page with FrontPage and upload with WS_FTP.

And Google has crawled my image file:
domainname.com/images/

Disallowing all this with the robots would not work, since I have no idea what pages google will crawl - I cannot disallow this: domainname.com/foldername/?C=S;O=A because this means I will block all access to that folder, and I cannot do that as it's part of my site, and I want the pages crawled, but just the pages, no made-up parent directories.

Obviously I am doing something wrong, but for the life of me - I dont know what - the rest of my websites look just fine...

atlrus

1:31 pm on Jul 7, 2006 (gmt 0)

Placing an index page in those dirs sounds good.
Would it work if I have empty index.htm pages, and then disallow them from the robots?

PCInk

1:38 pm on Jul 7, 2006 (gmt 0)

Your file permissions on the log file must be set to allow the public to read that file. Try changing the permissions.

europeforvisitors

2:22 pm on Jul 7, 2006 (gmt 0)

Have you checked the log settings of your WS_FTP program? In my version of WS_FTP Pro, you can reach them as follows:

1) Click "Options" from the "Tools" menu at the top of the WS_FTP window.

2) Select the "Logging" tab.

You'll see checkboxes for "Enable Session Logging" and Enable Transfer Log." On my system, both are set to the program's defaults of C:\Documents and Settings\All Users\Application Data\Ipswitch\WS_FTP\Logs; if you see something different, change the log locations or uncheck the checkboxes and see if that keeps the logs from uploading. (The logs don't belong on your server anyway.)

atlrus

3:00 pm on Jul 7, 2006 (gmt 0)

I have unchecked to logging, I hope it'll work.
But the rest of my websites dont have that kind of cached pages on Google, and I use WS_FTP for them as well...

And the fact that Google is crawling it - that's what bothers me the most. Unless there is a link to it (which there is none) Google has no business crawling that "deep", especially when almost all of my pages still got the suplemental problem. The site ranks good though, maybe the number of links triggers this kind of behaviour?

ansible

11:16 pm on Jul 8, 2006 (gmt 0)

You can use robots.txt to disallow all of something, like logs (in your case). It would just look like this:

Disallow: /*.log

You can just disallow that file:

Disallow: /WS_FTP.log

Once you do this, with some time, it should remove your log file from search engine indexes.

bruceh

12:07 am on Jul 9, 2006 (gmt 0)

IndexIgnore */* in .htaccess will prevent showing files in a folder without an index. Alternatively create an index that could serve as a thin sitemap with links to your most important pages. If you have google toolbar installed then google can find any file you view. On a positive note google only gets this aggressive indexing everything it finds if your site reaches a certain threshold of importance. So if google lists obscure pages it means you must be doing something right.

dudibob

12:51 pm on Jul 17, 2006 (gmt 0)

this is crazy, my little site that isn't designed got indexed with a robots.txt on it with no links not too long ago.

Last week for a test, I removed the robots.txt, put some pages up, viewed them all with the Google toolbar, few days later all were indexed and today they don't exist!?

crazy indeed

whoisgregg

1:37 pm on Jul 17, 2006 (gmt 0)

Unless there is a link to it (which there is none) Google has no business crawling that "deep"

I'm pretty sure the Privacy Policy of the Google Toolbar [google.com] does give them the right to use the URL data you provide them in any way they see fit:

In addition, we use log information about aggregate Toolbar usage to improve the quality of Toolbar and other Google services.

Of course, you needed to have turned on advanced features for that to happen:

The Google Toolbar automatically sends only standard, limited information to Google, which may be retained in Google's server logs. It does not send any information about the web pages you visit (e.g., the URL), unless you use Toolbar's advanced features.

Added: My comments only apply in this case if the OP was using the Toolbar when viewing these pages, but are included for anyone with similar issues/concerns. :)

cabowabo

1:44 pm on Jul 17, 2006 (gmt 0)

I design the page with FrontPage and upload with WS_FTP. Obviously I am doing something wrong, but for the life of me - I dont know what

Well, for starters, you are using FrontPage. :-)

I would suggest modifying your .htaccess, as it will give you more control that your robots.txt and outsiders can't just request your .htaccess to see it. A suggestion to disallow *.log was an excellent one ... and you can go beyond that to only allow certain extensions to be indexed. That should limit the phoney "string" URL from showing up in the index.

However, if your .htaccess file gets too large, you will start to see loading time of your site extended. That is when you start transfering information over to your config file.

Best of luck to you getting a handle on this one. Googlebot misbehaving again...

CaboWabo

rohitj

2:15 pm on Jul 17, 2006 (gmt 0)

google will go through urls in referral logs as well. So if you ever visit a site, and it gets your ftp site as a referral, there's a possibilty that google may in turn get to your FTP site.

That's my understanding, since i've seen first hand that google had to have gotten my urls from referral logs. When i did searches, I eventually found a foreign referral log, listing the unique url in question.

Tapolyai

2:49 pm on Jul 17, 2006 (gmt 0)

This brings an interesting question.

I hear "it's your fault" or "you need to..." with various good technical information how to forcefully disallow Google to read information.

Is a web site like a property, where Google has an "easement" to anything on the property?

On my property (let's say a store), I do not have to put up a fence to have the right of no intrusion. It is inherently wrong to wonder on my property without my consent.

More importantly, it is not the person entering my property that defines how I need to notify them.

So how does this work with web property? Do Google, and other information collecting systems inherently have the "right-of-way" to collect my web site?

I am always leary of opt-out solutions, because I am forced into an action despite non-participatory action. i.e. If I had zero interest in the first place, I still have to act. It feels like spam.

mattg3

2:54 pm on Jul 17, 2006 (gmt 0)

If you don't want certain stuff of yours to be indexed, you need to disallow it in your robots.txt file.

So it's there for all the freak stalkers to look at that download the robots.txt .... and go then exactly to these pages ..

What we need is a submitable robots.txt not an open one ...

OK you can play around with access controlling IP adresses and Referrers, it's still no use.

[edited by: mattg3 at 2:57 pm (utc) on July 17, 2006]

whoisgregg

2:55 pm on Jul 17, 2006 (gmt 0)

The web is public. Without restricting who can enter your site or what you post on your site, it is all available for anyone, human or robot, to download and process.

Be thankful the nice bots have procedures and policies that make it easier to restrict their access. Not all bots have such courtesy.

On my property (let's say a store), I do not have to put up a fence to have the right of no intrusion. It is inherently wrong to wonder on my property without my consent.

Actually, in the U.S., it isn't wrong to wander until the property owner (or representative) tells you to leave either verbally or by posting a notice and you refuse to leave. Trespassing is only trespassing if, once notified, the trespasser fails to leave the property.

mcavic

4:01 pm on Jul 17, 2006 (gmt 0)

The web is public. Without restricting who can enter your site or what you post on your site, it is all available for anyone, human or robot, to download and process.

Yes. And:

domainname.com/foldername/?C=D;O=A
domainname.com/foldername/?C=S;O=A

This shows where Googlebot found the FTP log. It visited domainname.com/foldername/ and there the Web server gave it a directory listing. Visit that link yourself, and you'll see it. Googlebot then visited all of the links on that page. The?C=D;O=A is a link that re-sorts the directory listing.

Would it work if I have empty index.htm pages, and then disallow them from the robots?

Placing an empty index.htm page will prevent anyone from discovering what files are in the directory. But it won't remove any files that are already in Google's index. But you could create the empty file, then rename the directory and change all of your links.

rohitj

5:25 pm on Jul 17, 2006 (gmt 0)

We've been through the analogies about the web numerous times, and it seems that it's always been clear that you can't compare the web to your home. In your home, if you don't lock your doors, you can still expect a person to knock before entering. The web isn't like that and its a clear assumption that is inherent in how the web operates. If the web didn't function like that, it'd be much harder to navigate and be far less useful to us.

whoisgregg

6:40 pm on Jul 17, 2006 (gmt 0)

In your home, if you don't lock your doors, you can still expect a person to knock before entering.

On the web, there aren't doors unless you put them there. Even if you put up a door, there's no locks until you install them.

To continue the analogy, not linking to a page on your site is like putting something on your back porch. Sure, they have to walk around the back or drive down the alley to see that stuff, but it's still out in the open.

g1smd

6:41 pm on Jul 17, 2006 (gmt 0)

>> So it's there for all the freak stalkers to look at that download the robots.txt .... and go then exactly to these pages... <<

Not if you disallow a folder, then make sure the folder does not serve a directory index, and put files with un-guessable names in there, or maybe even put the files one folder deeper in an un-guessable sub-folder name.

mattg3

2:30 am on Jul 18, 2006 (gmt 0)

Not if you disallow a folder, then make sure the folder does not serve a directory index, and put files with un-guessable names in there, or maybe even put the files one folder deeper in an un-guessable sub-folder name.

Yes obviously there are workarounds ... it's still a solution from a time when the net was all fluffy and web pages could be ranked by the number of links pointing to them.. ;)

This 45 message thread spans 2 pages: 45