Forum Moderators: goodroi

Message Too Old, No Replies

Do I need to disallow files that aren't linked to?

for example, 'included' PHP & javascript pages

         

sssweb

8:26 pm on Nov 30, 2006 (gmt 0)

10+ Year Member



Do I need to disallow robots from spidering pages that aren't directly linked to on any web pages?

For example:

php pages accessed via 'include' statements

javascript pages accessed via <script language="JavaScript" src="/script.js"></script> tags

stylesheets linked to in the <head> tag.

Maybe the short question is, do SE's spider as though they're a typical user, only visiting files explicitly linked to in the displayed pages, or does it read your code and access every file it comes across?

phranque

10:05 am on Dec 1, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



you should consider the robot as if it were a nongraphical browser.
when requesting a page the php includes will be handled by the server and the results will be sent to the robot.
however, when most robots see a
<script language="JavaScript" src="/script.js"></script>

tag they ignore the js.
images and other embedded objects are also ignored since these typically involve visual rather than textual elements.

sssweb

5:13 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



when requesting a page the php includes will be handled by the server and the results will be sent to the robot.

So you're saying I need to disallow those pages?

Also, what about style sheets:

<link rel="stylesheet" href="/style.css" type="text/css">

Just out of curiosity, you say robots ignore images; how do they turn up in a Google image search then? Or is that done by a special (separate) spider?

Demaestro

5:22 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



For Google there is a separate bot that does the images cataloging.

I wouldn't be worried about PHP files and such but .doc or .pdfs you may not want them indexed. Even if they aren't linked to from your site the bot may still find them if someone views them with a Google tool bar or if someone else who knows the URL links the document from their site.

If you don't want the files to end up in Google then restrict the bot, if you don't care then don't bother.

[edited by: Demaestro at 5:37 pm (utc) on Dec. 1, 2006]

jdMorgan

6:21 pm on Dec 1, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



An included file is part of the including page -- The server "includes" it to create the page, serves the completed page, and that finished page is what the spider sees.

If you don't change your pages based on the requesting User-agent, then an easy way to see what the spider sees is View->Page Source in any browser.

There is no need to Disallow these included files or their directories unless you have reason to suspect that some third party may know or find their URLs and link to them for some malicious reason. If you're in a competitive market segment, then Disallow your include files directory and be done with it. But I'd be more worried about how someone found a URL that could be successfully used to reach them in the first place. In other words, this is not a robots/search ranking problem, but rather a security problem.

Jim

sssweb

8:32 pm on Dec 1, 2006 (gmt 0)

10+ Year Member



Thanks guys. I'll assume style sheets don't need to be disallowed either (unless someone posts different).

Oh yeah, what about /cgi-bin/?

phranque

1:16 pm on Dec 3, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



i would think robots use the style sheets in some cases to determine which areas are visible.

goodroi

3:06 pm on Dec 4, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



as jdMorgan wisely said
There is no need to Disallow these included files or their directories unless you have reason to suspect that some third party may know or find their URLs and link to them for some malicious reason.

since i am a paranoid person i would not take a chance. i assume a competitor will eventually try to cause me trouble so i would block these files. by leaving files hanging out in the open you take a risk (albeit a very small risk).

phranque

3:34 pm on Dec 4, 2006 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



as jdMorgan wisely said
There is no need to Disallow these included files or their directories unless you have reason to suspect that some third party may know or find their URLs and link to them for some malicious reason.

since i am a paranoid person i would not take a chance. i assume a competitor will eventually try to cause me trouble so i would block these files. by leaving files hanging out in the open you take a risk (albeit a very small risk).

included files are not the same as css files.
unless you are blocking unreferred access to the css file, for example, anybody can determine and enter the url and browse that file.
a server side include file can be included from a file path that is not web accessible.
a css file must necessarily be web accessible since it is directly requested by the browser in the normal course of accessing a page.