SteveWh - 9:46 pm on Nov 2, 2010 (gmt 0)
I'm starting to prepare to go live with my first web application that uses Perl, and would like to know if there are any security precautions that should be added that I've overlooked. The server is Linux/Apache/PHP/Perl.
Except that its functionality is different, the interface is similar to the W3C HTML validation "by direct input" form: the user selects various options and then pastes a large text block, which can optionally be the source code of a web page, into an HTML textarea.
PHP then writes the text block (in its raw unscrubbed state) to a temporary file with a random name (that is never revealed to the user) in a directory that is blocked from web access (.htaccess: deny from all).
PHP then invokes my Perl script to scrub and perform its processing on the text that's in the temporary file:
$cmd = escapeshellcmd('perl -wT ' .
escapeshellarg("/path/to/myscript.pl") . ' ' .
[various args, including the name of the temporary file]);
$perlresult = shell_exec($cmd);
It again interprets the option values and restricts them to legal values. Then...
use HTML::Scrubber;# strips HTML tags
use HTML::Entities;# converts HTML entities
# [...code omitted...]
# READ ALL INPUT INTO A SINGLE STRING
# SO THAT SUBSEQUENT SEARCHES FOR TAGS
# CAN SUCCEED EVEN IF OPENING AND CLOSING
# TAGS ARE ON DIFFERENT LINES.
my $intext;# the entire text in one scalar string
while(<>)# THIS READS THE TEXT FROM THE TEMP FILE
$intext .= lc($_);# lower case all
# ---- FILTER THE INPUT TEXT
# I don't know if the next line's "do-nothing" decoding
# will actually accomplish anything. Its intended purpose
# is to turn all unsupported (i.e. illegally encoded) chars into
# legal supported ones even if the resulting output is garbage.
Encode::from_to($intext, "cp1252", "cp1252", Encode::FB_DEFAULT);
# Must strip out any embedded PHP code
# *before* passing the text to Scrubber,
# whose processing does not strip PHP,
# but does make PHP tags subsequently unfindable,
# while it preserves their potentially malicious contents in the text.
# TODO: also strip ASP and what other code?
$intext =~ s/\<\?(php)?.*?\?\>/ /sig;
my $scrubber = HTML::Scrubber->new;
$intext = $scrubber->scrub($intext);
# ---- DECODE ENTITIES.
# THE SCRIPT'S PROCESSING NEEDS THE ACTUAL CHARS, NOT "'" etc.
$intext = decode_entities($intext);
# CHANGE CONTROL (0-31) AND SPACE CHARS TO SINGLE SPACE.
$intext =~ s/[[:space:][:cntrl:]]+/ /gi;
# THIS ALTERNATIVE ADDS TESTS FOR ANY REMAINING < AND >
# I AM NOT SURE WHETHER THIS IS NECESSARY.
$intext =~ s/[[:space:][:cntrl:]\<\>]+/ /gi;
# AT THIS POINT, $intext CONTAINS ONLY THE READABLE TEXT FROM THE
# WEB PAGE (IF THAT'S WHAT IT WAS), WITH ALL TAGS REMOVED.
In the remaining code, Perl processes the text and prints its summary output, which is received by PHP, which places the report on the page in an HTML textarea. Currently, PHP does not do any entity conversion. I haven't yet determined if that's necessary for textarea use. Then PHP deletes the temp file.
My main concern (at least the one I'm aware of) is whether it's possible for the unscrubbed text in the temporary file to contain any kind of exploit that could subvert or hijack the Perl <> operator while it reads the file, or subvert or corrupt HTML::Scrubber's processing of the text as it strips the tags.
Hopefully, if my code is any good, it can serve as an example to help someone else. If it needs fixing, that information can help anyone who reads what's wrong with it. Thank you to anyone who's willing to look at it.