Forum Moderators: open

Message Too Old, No Replies

How does google handle directories? How should they?

         

GilbertZ

10:30 pm on Nov 8, 2002 (gmt 0)



If there is a link from a page in G's index to:

domain.com/folder1/folder2/mypage.htm

Does google autmatically add domain.com/folder1/ and domain.com/folder1/folder2 to the database to crawl?

Just wondering because sometimes people may inadvertently be at a webhost who changes their default ability to view a directory, and where it was invisible before, without warning, potentially private content might be indexed...

So once we have the answer about how G handles it, next question is how do you guys think google should handle it?

My vote, reluctantly, would be not to index those folders unless there was a direct link to it...

onionrep

10:36 pm on Nov 8, 2002 (gmt 0)



if there are links to the files within the directory and they offer no resistance then google will crawl it.

if you do not wish google to have access to these files then you should use htaccess or a robot.txt file specifying the disallowed dirs

GilbertZ

4:55 am on Nov 9, 2002 (gmt 0)



I was probably unclear..so I'll try again...

The link is to

[webmasterworld.com...]

So of course that file will be indexed...but will google automatically search for an index.html file in

[webmasterworld.com...]

and

[webmasterworld.com...]

see what I mean? It was the "folder" "up" option in the toolbar that made me think about this.

dkoller

6:33 am on Nov 9, 2002 (gmt 0)

10+ Year Member



Gilbert,

Funny you should mention this. I was looking at my logs yesterday for a new site that google just crawled.
It has some folders without index files, but other linked content within them, like your site does. Googlebot did eventually try to retreive the index in my case (mysite.com/folder/), and got a 403 forbidden. It did not, however, seem to slow down or 'annoy' googlebot whatsoever. It continued to fetch pages within the folder normally after the 403.

(I'm going to add the missing index pages in for next month anyways, though =p)

Since there are no links directly to those folders that I know of, my assumption would be yes, googlebot checks folders it knows of automatically as a part of its crawling procedure.

GilbertZ

5:38 pm on Nov 9, 2002 (gmt 0)



That's amazing!

Very interesting...You would think a 403 would slow them down...

I'm not too worried for myself, we got domains on a bunch of different servers but the meat is on a dedicated server...

On an unimportant domain it did once happen to me that one day directory browsing was turned off, the next day without warning they turned it on...and I thought that non-savvy users might not understand that their personal data may be visible in a search engine...

I suppose legally speaking if it's on the web in a non-password protected directory, it is legal to index the content...having an orphan page is not enough..

troels nybo nielsen

6:29 pm on Nov 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I recently had a small problem of a similar kind. A page of mine is called mysite.dk/folder1/folder2/index.htm. For the time being there is nothing else in these two folders.

My link from the front page to this page so to speak jumps over an empty level. I had some unfinished pages on that level - with no links to them - but decided to remove them because of the risk of having them indexed.

T

PS: It seems that there are only new users in this thread. Welcome to WW all of you. (I hope I am not the first to write that?)

jomaxx

11:00 pm on Nov 9, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I once did a search on the name of a person I happened to be corresponding with, and found a .txt file containing his CREDIT CARD number, home address, etc -- along with data for a couple of dozen other people. He had bought something online at some small business, and they left their data file unprotected in just this way.

GilbertZ

12:54 am on Nov 10, 2002 (gmt 0)



Yup...you said it..and that's exactly where I was going..

I was managing a project once where one of my most talented kids on the team was just starting out and still very much in the phase where he loved "hacking" and always downloaded the latest warez and boasted of his "extra-curricular" activities...I always made sure to encrypt all my data when on the same network as him ;)

When you're in this business in your 30s sometimes it's hard to keep up with what the teenagers know...so to make sure they still respect you, you need to show them a trick or two. I asked my group to put out bets that I couldn't find people's credit cards on the net in clear text using only altavista within 15 minutes.

It only took me 5 and I found a stock site, and a page of other sites with credit cards in open text...

I emailed them all to protect their files...they were very grateful...but yes, your example really shows the privacy issues surrounding search engines..