Forum Moderators: open

Message Too Old, No Replies

Another regex question. removing pasted style elements

         

csdude55

1:56 am on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just to make things more complicated... I'm now trying to remove any problematic STYLE code that a user can copy and paste from another site.

I have an array of the styles that I think can cause problems:

var cssArr = [
'display',
'background',
'position',
'white-space',
'width',
'height',
'min-width',
'min-height',
'padding',
'margin',
'position',
'top',
'bottom',
'left',
'right'
]


I'm not worried about font sizes, formats, or colors; I'm mainly worried about stuff that can stretch out my page, or move things around to where they shouldn't be.

So I guess first, do you guys see anything that I should add to this list?

Second is the regex itself. The only way I can think to do it is with 2 loops, which seems excessive. Like so:

var a = $('#comment').html();

var pattern;
for (var i = 0; i < cssArr.length; i++) {
pattern = RegExp("style=(\"|')(\w+;)*\s*" + cssArr[i] + "[^:>\1]*:\w+?[;\1]", 'gi');

while (pattern.text(a)
a = a.replace(pattern, 'style=$1$2');
}

// remove any leftover empty style="" tags
a = a.replace(/style=("|')\s*\1/gi, '');


What do you guys think, is this the only real way to accomplish this? If so, any problems with the code I typed up?

lucy24

4:22 am on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Honestly, I'd think twice about permitting style=blahblah at all. Delete it across the board, and that will just leave the basic HTML tags like <i> and <b> and <a href = "http://example.spam/">. Uhm. Lemme think this through again. This isn't a forum for design professionals, is it, where it's utterly crucial to say This particular content absolutely needs to be displayed in 13 point olive-green Avant Garde or my point cannot be made.

keyplyr

4:28 am on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Or you could just use a text editor an create a separate file sans the style, and offer that.

With all these jQuerry, Ajax & JS scripts that need to download for each new visitor, how fast is you site?
[developers.google.com...]

Don't forget, we're now in the Mobile-first Index where speed is important.

csdude55

7:05 am on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Honestly, I'd think twice about permitting style=blahblah at all. Delete it across the board, and that will just leave the basic HTML tags like <i> and <b> and <a href = "http://example.spam/">

Tempting! But I opened that particular Pandora's box about 5 years ago, and it's too late to turn back now :-( Remember that I'm working with a contenteditable section, and I've always allowed font families, font sizes, colors, bold, italics, and underline... as well as allowing the user to copy articles from other sites. So at the very least I have to allow for those styles.

Stripping everything except for those formats will become a bit of a challenge. From my experience, one browser might say <span style="color: black">, while another might say <font style="color: rgb(0, 0, 0);"> (for the exact same copy and paste). And one might say <p style="font-weight: bold"> where another says <div style="font-weight: 900">. And so on.

So while I like the idea of stripping all CSS and then put back the ones that I would approve, I'm really not quite sure how to do that.

With all these jQuerry, Ajax & JS scripts that need to download for each new visitor, how fast is you site?

Believe me, speed has been one of my biggest concerns! From my experience, every fraction of a second that I can shave off of load time results in more pages per session.

I just tested my homepage using Chrome... DOMContentLoaded: 1.88s, Load: 2.67s

BTW, there are two errors in the sample code I posted:

// I'm pretty sure this should be "new RegExp", not just "RegExp"
pattern = new RegExp("style=(\"|')(\w+;)*\s*" + cssArr[i] + "[^:>\1]*:\w+?[;\1]", 'gi');

// two typos here: I said "text" instead of "test", and left off the last closing )
while (pattern.test(a))

robzilla

9:15 am on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I love regular expressions, but still avoid them whenever I reasonably can. So, what if you (1) iterate over all children of #comment, (2) check if they have the style attribute, (3) maybe check if they have CSS attributes from cssArr, (4) then instead of using replace() you simply reset the value using css(), effectively removing the attribute from the element's (in-line) styling. Just a thought.

robzilla

12:26 pm on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By the way, you have "position" in there twice, and "background" and "margin" are like shorthand attributes, you could also see "background-color", "margin-left", etc. While you could account for that in the regex, perhaps it's easier to have an allow list than a deny list?

lucy24

5:07 pm on Jul 12, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do you permit anything in the form foo-bar? I notice for example that you list both min-width and min-height, but what about max-eachofthose? If your array itself--that is, the stuff the array is used for--permitted Regular Expressions you could categorically wipe anything in the form "\w+-\w+".

Hm, OK, "text-transform" is probably safe.

csdude55

1:26 am on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I love regular expressions, but still avoid them whenever I reasonably can.

I typically agree, they're traditionally slow and heavy to process :-(

So, what if you (1) iterate over all children of #comment, (2) check if they have the style attribute, (3) maybe check if they have CSS attributes from cssArr, (4) then instead of using replace() you simply reset the value using css(), effectively removing the attribute from the element's (in-line) styling.

I'm not quite sure how to do this, but it sounds like a possible plan. Do you mind throwing together some pseudo-code to give me an idea of what you're thinking?

Do you permit anything in the form foo-bar? I notice for example that you list both min-width and min-height, but what about max-eachofthose?

In theory, I guess I would allow anything that's not going to mess up the end page. For example, the reader is going to see a section like this:

<div style="width: 600px">
<div style="width: 100%; height: auto; border: 1px solid #000000; background: #FEFEFE">
This is what the first person submitted
</div>

<div style="width: 100%; height: auto; border: 1px solid #000000; background: #FEFEFE">
This is what the second person submitted<br><div style="font-weight: bold">Kewl.</div>
</div>
</div>


I want to prevent something like:

<div style="width: 600px">
<div style="width: 100%; height: auto; border: 1px solid #000000; background: #FEFEFE">
<div style="width: 2000px; margin-bottom: 1000px">This is what the first person submitted</div>
</div>

<div style="width: 100%; height: auto; border: 1px solid #000000; background: #FEFEFE">
<div style="position: absolute; top: -50px; right: 1000px">This is what the second person submitted<br><div style="height: 5000px; font-weight: bold">Kewl.</div>
</div>
</div>


So in theory, min-height and min-width would be a problem because they could stretch the container beyond the 600px. But max-height and max-width wouldn't be a problem because they wouldn't really change anything.

But I could still happily eliminate them to make the database slightly smaller :-) I guess for that matter, I should also try to find a way to see if a class or id is being used that's not defined in my stylesheet (which would imply that it's being pasted from someone else's site and not mine), and then remove those, too.

I found a somewhat complete list of CSS styles, but it doesn't include -moz and -o formats. I should probably include those, too, so that I don't inadvertently mess up someone's page on a specific browser. I was originally thinking about removing all of these except for the font-styles, colors... and maybe line-height.

align-content
align-items
align-self
all
animation
animation-delay
animation-direction
animation-duration
animation-fill-mode
animation-iteration-count
animation-name
animation-play-state
animation-timing-function
backface-visibility
background
background-attachment
background-blend-mode
background-clip
background-color
background-image
background-origin
background-position
background-repeat
background-size
border
border-bottom
border-bottom-color
border-bottom-left-radius
border-bottom-right-radius
border-bottom-style
border-bottom-width
border-collapse
border-color
border-image
border-image-outset
border-image-repeat
border-image-slice
border-image-source
border-image-width
border-left
border-left-color
border-left-style
border-left-width
border-radius
border-right
border-right-color
border-right-style
border-right-width
border-spacing
border-style
border-top
border-top-color
border-top-left-radius
border-top-right-radius
border-top-style
border-top-width
border-width
bottom
box-decoration-break
box-shadow
box-sizing
caption-side
caret-color
@charset
clear
clip
color
column-count
column-fill
column-gap
column-rule
column-rule-color
column-rule-style
column-rule-width
column-span
column-width
columns
content
counter-increment
counter-reset
cursor
direction
display
empty-cells
filter
flex
flex-basis
flex-direction
flex-flow
flex-grow
flex-shrink
flex-wrap
float
font
@font-face
font-family
font-kerning
font-size
font-size-adjust
font-stretch
font-style
font-variant
font-weight
grid
grid-area
grid-auto-columns
grid-auto-flow
grid-auto-rows
grid-column
grid-column-end
grid-column-gap
grid-column-start
grid-gap
grid-row
grid-row-end
grid-row-gap
grid-row-start
grid-template
grid-template-areas
grid-template-columns
grid-template-rows
hanging-punctuation
height
hyphens
@import
isolation
justify-content
@keyframes
left
letter-spacing
line-height
list-style
list-style-image
list-style-position
list-style-type
margin
margin-bottom
margin-left
margin-right
margin-top
max-height
max-width
@media
min-height
min-width
mix-blend-mode
object-fit
object-position
opacity
order
outline
outline-color
outline-offset
outline-style
outline-width
overflow
overflow-x
overflow-y
padding
padding-bottom
padding-left
padding-right
padding-top
page-break-after
page-break-before
page-break-inside
perspective
perspective-origin
pointer-events
position
quotes
resize
right
tab-size
table-layout
text-align
text-align-last
text-decoration
text-decoration-color
text-decoration-line
text-decoration-style
text-indent
text-justify
text-overflow
text-shadow
text-transform
top
transform
transform-origin
transform-style
transition
transition-delay
transition-duration
transition-property
transition-timing-function
unicode-bidi
user-select
vertical-align
visibility
white-space
width
word-break
word-spacing
word-wrap
z-index

csdude55

1:51 am on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Update, I found this on W3 that appears to be a regularly-updating page:

[w3.org...]

and with a JSON file:

[w3.org...]

But it still doesn't include -moz, -o, -ms, or -webkit properties so I'm not sure how to handle those.

In theory, though, one could import all of these properties in PHP (maybe use a cron to save it to a MySQL table once a month or something), set them to an array, exclude the ones that are acceptable, and then I guess use the two loops to remove the rest?

lucy24

2:12 am on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



but it doesn't include -moz and -o formats
I shouldn't think -o- is much of an issue: doesn't that go back to when Opera had its own rendering engine, whereas now it's all webkit? And -ms- is even more fun, since it requires MSIE to admit that we do this particular thing differently from all other browsers, and I don't really see that happening. In any case you would be perfectly justified in wiping any and all vendor extensions, i.e. stuff beginning in a hyphen, unless you have a lot of users on elderly browsers who don't recognize things like “column-count” but require “-moz-column-count” instead--and then you might need to add extensions so it displays as intended on everyone's browser, not just the person who posted it.

Which brings us to ... just how complicated are the things your users are posting? I realize we're not allowed to name names or get too specific, but it may be helpful to explain a little about what your users are doing and how it fits into the overall page. It sounds like you've got something above and beyond the ordinary discussion forum where any given style is either recognized or not recognized, and if you see that you tried something and it didn't work*, you just go back and edit the post until it works as intended.

If you've got people trying to shove 2000-px-wide elements into something that's only 600px wide, make sure your overall CSS has the appropriate “overflow” setting. (I remember seeing some pretty appalling things in early versions of php/pp when people posted grossly oversized images, or long uninterrupted text strings, but later versions figured out how to deal with it.)


* If <tt> doesn't work, try it over again with <kbd>, and if that doesn't work try <pre>, and so on down the line.

csdude55

2:59 am on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Which brings us to ... just how complicated are the things your users are posting?

Well, that's where it gets interesting.

I have 4 or 5 message boards that use contenteditable. One of them focuses on sports, and another (while more general) has a lot of political posts.

What happens is that people commonly go to other websites (local newspaper sites, CNN, ESPN, the list goes on) and copies either part of an article or the whole thing to paste in to my forum.

Obviously, some of these sites are ancient and have code from the 90s, while others are ultra-modern and have code that only the newest browsers would recognize. There's no rhyme or reason to it.

Worse, the user's browser often interprets copied code in its own way, so I could paste something in Chrome then paste in Firefox, and the code that comes through would be completely different. Every once in awhile I still see someone post something with class="Apple-style-span!

Here's an example of a post someone made earlier today, copied from a county website:

<div style="position: absolute; left: 357.02px; top: 151.43px; font-size: 30px; font-family: serif; transform: scaleX(1.06183);">POSITION VACANCY</div><div style="position: absolute; left: 150px; top: 232.508px; font-size: 26.6px; font-family: serif; transform: scaleX(1.03406);">Tax Administrator</div>


Luckily I've already been removing variations of position:, or this would have messed up the page. But the end result still had left and top, so even though they didn't do anything they're still taking up space and "could" have been safely removed.


If you've got people trying to shove 2000-px-wide elements into something that's only 600px wide, make sure your overall CSS has the appropriate “overflow” setting.


I do, but then the pasted article is hard to read when the user has to move back and forth. And they inevitably blame me and my "poor designing skills" for the page looking bad! lol So I would rather just strip that, too.

robzilla

10:23 am on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ironically, this message board can't handle code very well, so I'll just point to this fiddle [jsfiddle.net].

When you press the button, it strips all forbidden style attributes from the HTML contents of the textarea.

There may very well be more efficient ways to achieve this, but... no regular expressions.

Strictly speaking, you can't rely on this from a security perspective, of course, since anyone can still manipulate the POST data.

robzilla

12:33 pm on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Reviewing the code now, I see at least one mistake, so I suppose it's more of an illustration of what I meant rather than production-ready code. The mistake is that I now only look for child elements containing the style attribute when the parent element (<div> in this case) also has the style attribute. If you have something like...
<div><span style="position:absolute;">example</span></div>
...then that position attribute won't be removed. For some reason, I couldn't get it to work with a single find().each() or filter().each() loop, which is what I meant to say where the comment reads "Find all children of this element and repeat the process (couldn't get it to work)".

Shouldn't be too hard to fix, hopefully you can do that yourself if you end up using it ;-)

csdude55

7:36 pm on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's awesome, Rob, thanks! Did you write that just for me, or is it a template / cookie cutter? I didn't mean for you to go to all that trouble for me, I really appreciate that!

robzilla

10:58 pm on Jul 13, 2018 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



No worries, it was a fun little exercise :-)