Stopping people breadcrumbing via cpanel - Apache Web Server forum at WebmasterWorld - WebmasterWorld

Forum Moderators: phranque

Message Too Old, No Replies

Stopping people breadcrumbing via cpanel

Does it stop search engines spidering

Lame_Wolf

11:30 pm on Nov 26, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Hello all,
I am not an expert on cpanel or anything behind the scenes, so I am sorry if I sound stupid.

I while back I noticed yahoo was listing results with examples like this...
http://www.example.com/folder/

Then I found some sites linking to such pages.

Now google is doing it.

I didn't want this. I want people to find... http://www.example.com/folder/filname.html

Someone told me to block them in index manager, which I did.
If they now type http://www.example.com/folder/ it produces a 403 error, which is good for me if it is a visitor, but Webmaster Tools is now showing 4 http errors. These errors all end in a folder name and not a .html extension.

I only blocked first level ie, http://www.example.com/blockedfolder/

WMT shows errors deeper than that, ie http://www.example.com/blockedfolder/folder/folder/

I do not want to stop google etc to stop caching what is inside those folders (mainly pics)

WMT doesn't tell me if they were errors from another site (possible), or if they happened whilst spidering my site (more likely).

Does anyone know if this affects spidering by search engines, or is it a case of them just dropping the search results in the SERPS for the folders alone ?

Thank you

wilderness

2:59 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Perhaps the most effective improvement that you could make for "honorable SE's" is to learn how to create a valid robots.txt, which is a "suggestion" to the bot about whether to enter or leave your site (s) and what folders or pages you do not wish the bots to view.

Of course this will NOT help you with the rogue bots.

This may help
[robotstxt.org...]

Samizdata

3:18 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Firstly, I don't use cPanel and can't comment on its functions.

Yahoo Slurp has long had an annoying habit of requesting the directory root even when no links point to it. I have never seen Google do this, but I wouldn't be surprised if they did.

It's probably best to ensure that all directories allowed by robots.txt contain a default index file - even if it's just an empty one with a link to home. If you want people to find "filname.html" then you need to ensure that the actual filename is specified as a default option (this can be done in .htaccess).

I wouldn't say that serving a 403 to a human "type-in" was a good idea at all.

...

[edited by: Samizdata at 3:22 am (utc) on Nov. 27, 2008]

daveVk

5:43 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Use "Index manager" in cPanel to set "no index" on all directories, I think you need to do each directory individually. Directories containing an index.html file will normally not list.

Lame_Wolf

7:38 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

First, I'd like to say thank you for the replies. Secondly, my 403 is a customised one. It has the menu on the left etc.

My robots.txt file allows bots to access all the site.
My main concern is that google etc will not cache the images inside those folders. I think it will, but needed expert advice, rather than wait a few months and see all my images dropped from the search engines.

If it is a case of the SE's listing a folder as a SERP rather than a .html page then that's okay with me. I'd rather they didn't do it, but no big deal.

Lame_Wolf

7:48 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Use "Index manager" in cPanel to set "no index" on all directories, I think you need to do each directory individually. Directories containing an index.html file will normally not list.

I did use Index Manager in cPanel, and did set it to "no index"
The site has a number of folders. Some have folders within folders, and some are folders within folders within folders. So, I only did it to each of the top-level folder.

Since mentioning it here, I made an index.html and placed it in one of the top-level folders. Now, www.example.com/blockedfolder/ produces the new index.html file and anything deeper than www.example.com/blockedfolder/ shows the customised 403 page.

What I am still not clear on is... will google etc still be able to cache the images / html pages inside those folders ? I hope so.

I take it that when I see a SERP for www.example.com/blockedfolder/imagefolder/ it will mean that the search engine thinks that (the example given) is a real URL rather than a breadcrumbed URL.
BUT wouldn't stop the search engine listing www.example.com/blockedfolder/imagefolder/image.html

daveVk

11:44 am on Nov 27, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Setting "no index" on a directory or sub directory will stop a directory listing being produced, I take it you no longer get directory listings at any level, good.

Provided all pages still have a link to them, and all images are displayed at least once there should be no problem with content being found by the search engines, indeed your results may improve as the remaining links have better context.

jdMorgan

4:19 am on Dec 2, 2008 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

To clarify this, be aware that search engines assign meaning only to URLs, not pages, not directories, and not images. If you have links on your own site to things that don't exist, that's bad as far as the search engines are concerned. But they know full-well that they should not and cannot 'blame' you for bad links to your site on other sites.

What I'm saying here is that returning a 403 for a request for example.com/images/ won't have any effect on search engines' treatment of example.com/images/image.jpg, because example.com/images and example.com/images/image.jpg are not the same URL.

I have many sites which do not have an index page in subdirectories -- The only index page is at root. And I have set the -Indexes Option in my .htaccess file(s) on these sites so that directory indexes cannot be viewed (this is a 'manual' method similar to using the cPanel controls, but more customizable).

As Samizdata said above, Yahoo's Slurp spider has been requesting non-existent and un-linked subdirectory index URLs --and consequently, eating 403s-- for several years. But that does not effect the spiderability or ranking of pages or images in those subdirectories at all.

robots.txt proicessing uses prefix-matching, which may be confusing to some readers here. But robots.txt processing and server-side access restrictions are two different things, and the rules are different.

Jim