Forum Moderators: Robert Charlton & goodroi
I have a few more pages like that cached by Google - image directory, articles directory - pages with no links to them, pages and files I did not know existed...
Obviously Google is just crawling everyhting on my website's directory - pages and files and all.
To me - this is a privacy invasion.
Anyone else see this with their sites?
Lookout Googlebot is behind you ...
It sounds like you were using the google toolbar when your ftp log was pulled up on your end, and you don't have it disallowed in your robots.txt file.
When you use the google toolbar and have the pagerank bar enabled, it sends url data to google so.. so google knows what urls exist out there, even if they are not linked to anywhere. So you have to be careful about what links you pull up when the pagerank bar is enabled.
Disallowing in robots.txt may keep the Googlebot out but may attract other bots (the bad ones) and individuals to explicitly search for stuff you just tagged 'interesting' this way.
Kind regards,
R.
Now, I HAVE NEVER seen this file before I saw it on Google...There are no link, NO LINKS, pointing to this file.
Also, I decided to dig a little deeper today, and I found some crazy sh*t - suplemental parent directories cached, for example:
domainname.com/foldername/?C=D;O=A
domainname.com/foldername/?C=S;O=A
with the following snippet from the Google's description of the pages(pagename and foldername are replacements):
LOG 15-Feb-2005 23:11 588 pagename.htm 15-Jul-2005 11:50 68K foldername/ 07-Jul-2005 11:05 -. Apache/2.0.46 (Red Hat) FrontPage/5.0.2.2634 mod_ssl/2.0.46 ...
Now, I am no programer - I design the page with FrontPage and upload with WS_FTP.
And Google has crawled my image file:
domainname.com/images/
Disallowing all this with the robots would not work, since I have no idea what pages google will crawl - I cannot disallow this: domainname.com/foldername/?C=S;O=A because this means I will block all access to that folder, and I cannot do that as it's part of my site, and I want the pages crawled, but just the pages, no made-up parent directories.
Obviously I am doing something wrong, but for the life of me - I dont know what - the rest of my websites look just fine...
1) Click "Options" from the "Tools" menu at the top of the WS_FTP window.
2) Select the "Logging" tab.
You'll see checkboxes for "Enable Session Logging" and Enable Transfer Log." On my system, both are set to the program's defaults of C:\Documents and Settings\All Users\Application Data\Ipswitch\WS_FTP\Logs; if you see something different, change the log locations or uncheck the checkboxes and see if that keeps the logs from uploading. (The logs don't belong on your server anyway.)
And the fact that Google is crawling it - that's what bothers me the most. Unless there is a link to it (which there is none) Google has no business crawling that "deep", especially when almost all of my pages still got the suplemental problem. The site ranks good though, maybe the number of links triggers this kind of behaviour?
Unless there is a link to it (which there is none) Google has no business crawling that "deep"
I'm pretty sure the Privacy Policy of the Google Toolbar [google.com] does give them the right to use the URL data you provide them in any way they see fit:
In addition, we use log information about aggregate Toolbar usage to improve the quality of Toolbar and other Google services.
Of course, you needed to have turned on advanced features for that to happen:
The Google Toolbar automatically sends only standard, limited information to Google, which may be retained in Google's server logs. It does not send any information about the web pages you visit (e.g., the URL), unless you use Toolbar's advanced features.
Added: My comments only apply in this case if the OP was using the Toolbar when viewing these pages, but are included for anyone with similar issues/concerns. :)
I design the page with FrontPage and upload with WS_FTP. Obviously I am doing something wrong, but for the life of me - I dont know what
Well, for starters, you are using FrontPage. :-)
I would suggest modifying your .htaccess, as it will give you more control that your robots.txt and outsiders can't just request your .htaccess to see it. A suggestion to disallow *.log was an excellent one ... and you can go beyond that to only allow certain extensions to be indexed. That should limit the phoney "string" URL from showing up in the index.
However, if your .htaccess file gets too large, you will start to see loading time of your site extended. That is when you start transfering information over to your config file.
Best of luck to you getting a handle on this one. Googlebot misbehaving again...
CaboWabo
That's my understanding, since i've seen first hand that google had to have gotten my urls from referral logs. When i did searches, I eventually found a foreign referral log, listing the unique url in question.
I hear "it's your fault" or "you need to..." with various good technical information how to forcefully disallow Google to read information.
Is a web site like a property, where Google has an "easement" to anything on the property?
On my property (let's say a store), I do not have to put up a fence to have the right of no intrusion. It is inherently wrong to wonder on my property without my consent.
More importantly, it is not the person entering my property that defines how I need to notify them.
So how does this work with web property? Do Google, and other information collecting systems inherently have the "right-of-way" to collect my web site?
I am always leary of opt-out solutions, because I am forced into an action despite non-participatory action. i.e. If I had zero interest in the first place, I still have to act. It feels like spam.
If you don't want certain stuff of yours to be indexed, you need to disallow it in your robots.txt file.
So it's there for all the freak stalkers to look at that download the robots.txt .... and go then exactly to these pages ..
What we need is a submitable robots.txt not an open one ...
OK you can play around with access controlling IP adresses and Referrers, it's still no use.
[edited by: mattg3 at 2:57 pm (utc) on July 17, 2006]
Be thankful the nice bots have procedures and policies that make it easier to restrict their access. Not all bots have such courtesy.
On my property (let's say a store), I do not have to put up a fence to have the right of no intrusion. It is inherently wrong to wonder on my property without my consent.
Actually, in the U.S., it isn't wrong to wander until the property owner (or representative) tells you to leave either verbally or by posting a notice and you refuse to leave. Trespassing is only trespassing if, once notified, the trespasser fails to leave the property.
The web is public. Without restricting who can enter your site or what you post on your site, it is all available for anyone, human or robot, to download and process.
domainname.com/foldername/?C=D;O=A
domainname.com/foldername/?C=S;O=A
Would it work if I have empty index.htm pages, and then disallow them from the robots?
In your home, if you don't lock your doors, you can still expect a person to knock before entering.
On the web, there aren't doors unless you put them there. Even if you put up a door, there's no locks until you install them.
To continue the analogy, not linking to a page on your site is like putting something on your back porch. Sure, they have to walk around the back or drive down the alley to see that stuff, but it's still out in the open.
Not if you disallow a folder, then make sure the folder does not serve a directory index, and put files with un-guessable names in there, or maybe even put the files one folder deeper in an un-guessable sub-folder name.
Not if you disallow a folder, then make sure the folder does not serve a directory index, and put files with un-guessable names in there, or maybe even put the files one folder deeper in an un-guessable sub-folder name.
Yes obviously there are workarounds ... it's still a solution from a time when the net was all fluffy and web pages could be ranked by the number of links pointing to them.. ;)