Forum Moderators: open
Yes, I'm looking for a "complete guide"... anyone have a good link to one?
I'm been reading and searching, but I only find bits and pieces and THEN one article contradicts another.
To be specific in what I want to know:
1. I don't want certain directories and the sub pages in a search engine's database.
2. How do SE's treat .PHP pages?
If a .php file has NO html code, will a SE index it?
If a .php file has code that dynamically creates HTML code, will a SE index it?
3. I don't want to exclude directories via robots.txt file because the robots.txt file TELLS a "hacker" exactly what I'm trying to keep secure! (Am I paranoid?)
4. What exactly does htaccess prevent or not prevent when it comes to spiders and bots?
One thing I believe I have found is:
If a web page has a login and needs a username/password, a spider, bot, etc will not index that page. ((( as long as there is not links going to the sub pages OR a link does not include the username/password the link)))...
Is this correct?
Thanks for your help
1) Robots.txt is the easiest method for keeping SEs out of places they shouldn't go before they get there - you could also use .htaccess to enforce those rules.
2) SE's see whatever gets output by the script so if PHP outputs HTML then the SE will see that HTML, and index it, unless you tell them otherwise.
I'm intrigued by your question about a script not producing HTML and SEs - if the script is an intergral part of your site, then visiting it must do something! 9 times out of 10 that something can be understood by the SE (ie redirects, access denied etc), equally if the script is not designed for stand-alone access (eg include files) then it should be excluded so that SEs wont request it. SEs may recognise the .php extension to note that the page is generated by a scripting language but aside from that they will treat it like a normal page.
The only things to be aware of is how SEs treat dynamic pages that use lots of parameters (eg /catalogue.php?a=1&b=2&c=3&d=4 etc) in that they might limit their crawling earlier than you expect. If this would affect you there are a lot of really good solutions on these forums to getting crawled when you have lots of parameters so I'm not going to go there...
3) Are you being overly paranoid about robots.txt? Possibly... but there are three solutions to your worry;
My 2c says that "hackers" looking for vulnerable scripts wouldn't bother targetting your site uniquely unless you are a big name site or you have something they really want. Most would just use a search engine to find a pre-packaged script with a known vulnerability and exploit that - it's much easier and guarantees them a high success rate in a short timespan!
4) Sadly .htaccess isnt my area of experise but I whenever I've dabbled I've been amazed by the tricks it can do - I seem to remember that it can apply things like denies and redirects by a variety of criteria including IP addresses and user-agents to name but a few...
...and finally
If a web page has a login and needs a username/password, a spider, bot, etc will not index that page. ((( as long as there is not links going to the sub pages OR a link does not include the username/password the link)))...
That'll depend on the type of login you are talking about. If it's an .htaccess type login then there are potential reasons for SEs to avoid including "secured" pages in their index as they understand what is going on.
However if someone linked to an "insecure" page beneath the "secure" front page then I fail to see a reason why an SE wouldn't index it.
If it is a purely web-based login then SEs would have a harder time understanding its true purpose so it might end up in the index...
- Tony
If a .php file has NO html code, will a SE index it?
I am not quite sure why you would want to do that, unless you intend to deliver different content to the spider than you do to the surfer. I would STRONGLY advise against that approach, as it could get your domain banned for cloaking!
If a .php file has code that dynamically creates HTML code, will a SE index it?
Yes, subject to the concerns about CGI params in the URL (as already mentioned by Dreamquick). The HTML is generated at the server end, it looks like straight HTML to the spider. Same thing is true for ASP pages.
With regards to .htaccess, if you use it to password-protect a directory, then the spider can't index those pages unless you give it a way to reach them, like putting the username and password in a link. But if you don't, then your visitors will have to "login" to this area as well.
A possible solution to this problem would be to put the login information in a form that the spider can't follow, such as a Javascript-generated link containing the username/password into the .htaccess protected directory.
That way, the users don't have to bother with login, but the spider still can't get to the pages!
You are right (nice going spotting this oddity BTW) in as much as that's the way it is supposed to work and if we were talking about smaller more managable datasets thats probably true (ie enterprise engines) where everything can be crawled and cataloged.
However I would suspect that something like the GoogleBot spider needs to "second guess" the content of URLs quite often to determine which (if any) of the new links found on a page it will crawl without further investigation. This is where maintaining a list of common extensions and their projected content-types would come into play.
e.g.
Lets say that googlebot finds a link to a .exe file on a site do you think it will bother to crawl that link? What about .zip?
Now I dont have any .exe but I do have several .zip as part of my site and I can't find any log entries for November/December 2002 where the GoogleBot has ever crawled them, despite the fact that it retrieved all the other items relating to one of the code samples on the site and had crawled around other pages over that period.
For all it knows they might be another HTML extension on my site but since it doesn't even bother to make the request and learn the content-type it would never spider them.
- Tony
I'm trying to process all of this... (not going so well :)
---
So, this is my situation:
1. website has this directory (public/secure/123)
2. "secure" is a directory along with its files (.html and .php pages) that I do NOT want a SE to find or index
3. Every .html pages will require a user login screen
4. The content in the .html pages is valuable BUT I DON'T want SE to find them because it is only for members.
5. The .php pages contain no "DIRECT" html content. Meaning the php code will dynamically create the html code OR it may not. It may just contain business logic, database stuff, etc.
6. POINT BLANK... I do not want non-members or hackers to find this directory nor do I want SEs to find this directory or the pages in the directory.
a. adding a robots.txt will point people directory to the secure directory (not good)
b. all secure pages have login required ( should keep SEs out?)... that would be good!
c. no links to the secure pages from out side web pages (this is good)
d. no password/user names in the links (all database controlled)... that is good
e. the "secure" directory has htaccess user/password security... BUT NOT the secure/123 directory (2nd layer to keep out SEs, non-members, hackers)
So, are a-e correct?
thanks
Welcome to WebmasterWorld [webmasterworld.com]!
a. adding a robots.txt will point people directory to the secure directory (not good)
b. all secure pages have login required ( should keep SEs out?)... that would be good!
c. no links to the secure pages from out side web pages (this is good)
d. no password/user names in the links (all database controlled)... that is good
e. the "secure" directory has htaccess user/password security... BUT NOT the secure/123 directory (2nd layer to keep out SEs, non-members, hackers)
a - true
b - true
c - no links at all from any page that SEs are allowed to index!
d - true
e - normally, any subdirectory under a protected directory is also protected
"Security through obscurity" is not security. The only way to guarantee that any search engine spider won't find or index a page is to remove it from all web-accessible servers. Some don't follow the rules because of implementation problems, and a few spiders even break the rules intentionally.
That's a hard fact to swallow, but it's true. If a search engine finds a single link to one of your "private" pages, it may list that link, even if it doesn't index the page.
Also, the behaviour of search engines with respect to robots.txt and on-page <meta name="robots" content="noindex,nofollow"> tags is inconsistent. For example, Google and Ask Jeeves will list a link (but no title or description) in their results if they find that link anywhere, even if the linked-to page is disallowed in robots.txt and they do not index (load and analyze) that page.
Once you adjust to these hard truths, here are some suggestions which may help:
Dreamquick's suggestion is to protect a subdirectory - let's call it "protected", and then place your content in a subdirectory of yourdomain.com/protected, so let's say the path is yourdomain/protected/secure. In robots.txt, you disallow access to /protected, and therefore to any of the subdirectories below it - which need not be listed in robots.txt. One thing you can do is to name the "protected" subdirectory something that is as utterly uninteresting as possible to a potential intruder on your site.
But then there remains the problem of Google and Ask Jeeves, who will list any links they find, regardless of whether they actually access the page. The only ways I've found to stop this annoying behaviour is to "funnel" all access to "private pages" through a "doorway page" which you allow them to index and access, but which contains the on-page <meta name="robots" content="noindex,nofollow"> tag. These search engines - and the others - will then ignore the page and all pages it links to. The other way to do it is to cloak the pages based on user-agents and IP addresses. Cloaking is not recommended unless you have the time to do it thoroughly and carefully. (In a case like this, we are cloaking with the intent to protect our property, rather than any intent to mislead search engines or visitors. Therefore we violate the word, but not the spirit, of search engines' anti-cloaking rules. But we still must be careful not to set off some cloaking-detector alarm the search engines might come up with, because they may not be capable of taking intent into account.)
Bear in mind that links operate independently of site directory structure, that directory structure is what controls .htaccess password protection, and what is used for control in robots.txt, and that file names and URLs need not have any relationship to each other (you can use Redirect and mod_rewrite on Apache, similar controls on other servers, and scripting tricks on all servers to map "purely-fictional" URLs to actual directory and file paths.
Using that fact and the technique above, you may be able to protect your content against script kiddies and amateur criminal hackers, and to keep the search engines from mentioning any links to that content, but is it truly secure? No, not as long as it's web-accessible... :(
HTH,
Jim
Your post answered my questions.
Now I get it. Can't believe I did not think of having 2 levels of directory and only list the FIRST one in the robots.txt.
I think that, along with my other "security" features... I can do what I need to.
Have a good one!
PS. Great :( --- Now I have to redo my directory structure and recode my logic for paths.... <smile>