Forum Moderators: coopster

Message Too Old, No Replies

user input processing

         

Skier88

8:12 pm on May 18, 2010 (gmt 0)

10+ Year Member



I'm working on an interactive website with functionality similar to a forum or wiki in that the user can update pages only with plain text, but certain patterns are displayed with html. For example,
[ b ]text[ /b ]
(without the spaces inside the tags) is displayed bolded.

My current system does a simple search and replace. In the above example it would scan for all occurrences of
[ b ]
(without spaces) and replace them with <b>, then do the same for every other supported tag. The problem with this is that the user will sometimes write invalid markup, and I don't want that to translate into invalid HTML. Also, if I use a table layout I can't support tables, since a post of only
[/table]
would break the page layout.

Does anybody know how I can efficiently convert only valid markup into html? I have a feeling that a clever regex could do the trick, but I'm not nearly that good with it yet.

Thanks for reading.

[edited by: Skier88 at 8:32 pm (utc) on May 18, 2010]

Readie

8:17 pm on May 18, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I use the following two regular expressions for my "allowed HTML" function:

<open> and </closed>
/<([^\s\/>]+)([^>]+)?>(?m)(.*?)(?-m)<\/\\1>/is


<standalone>
/<\/?([^\s\/>]+)([^>]+)?>/is


I also do an explode() and count() on quotations from the second back reference of these to make sure there's an even number of quote marks, so the following:
preg_match('/<([^\s\/>]+)([^>]+)?>(?m)(.*?)(?-m)<\/\\1>/is', $input, $out);
$check = 0;
if (in_array($out[1], $allowed_html_closed)) {
// allowed_html_closed is a pre-defined array of allowed tags generated from a database
if(!preg_match('/(onclick|ondblclick|onmousedown|onmousemove|onmouseout|onmouseover|onmouseup|onkeydown|onkeypress|onkeyup|style)/is', $out[2])) {
// The above checks for attributes we don't wan't to allow
if(count(explode('"', $out[2])) % 2) {
// The above makes sure that there is an even number of quotation marks
$check = 1;
}
}
}
if($check == 1) {
// Allow this HTML
} else {
str_replace($out[0], $out[3], $input);
// Removes the HTML, keeps whatever was enclosed by it
}
Note that I have a database table of "allowed HTML" - I run the open/closed prior to the stand alone, hence the reason why the standalone allows tags to start with </ - it's so I can remove invalid markup entirely.

rocknbil

1:10 am on May 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



1. Have a list somewhere of acceptable tags or tag equivalents. Example:

b
strong
em
i
etc. . . . .

2. When an instance is found, if it's not in your acceptable list, remove it. For acceptable ones, look for the closing element, if not found, remove them both.

3. Remove everything else that looks like an html/script/BBstyle code tag.

4. Complaints -> help files.

Alcoholico

8:09 pm on May 19, 2010 (gmt 0)

10+ Year Member



It's very difficult for a regex per se to produce clean code. In these cases I use tidy to fix the resulting html code after processing bbcode. Tidy will close improperly closed tags and correct other stuff. See tidy on php.net : [uk3.php.net...]

Skier88

3:25 pm on May 21, 2010 (gmt 0)

10+ Year Member



Thanks for the replies. Alcoholico, I'll look into it, but it would be nice if I could figure this out without tidy.

rocknbil, thanks, but that won't work. I would need to check that the closing tag is the same as the opening (no overlapping tags), and also that there are no closing tags without opening tags. I wrote a similar program in python a while ago, so I could implement a stack and deal with it that way, but I was hoping there was a more efficient method.

Readie, thank you for the code, especially the regex expressions. But I can't tell if it adresses the issues I stated about rocknbil's method. Also, quick question - why do you store acceptable tags in a database table? Wouldn't it be faster to hard code in an array?

[edited by: Skier88 at 3:28 pm (utc) on May 21, 2010]

Readie

4:26 pm on May 21, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I have them in a database for several reasons:

- So I don't have to trawl through all my code everytime I need to update the list.
- It's easier to change on the fly
- I can create an admin panel plug-in for someone else to be able to manage the list.