Forum Moderators: coopster

Message Too Old, No Replies

Cleaning up HTML code

And how to do it, attributes and value's etc..

         

BlackDex

10:22 am on Mar 5, 2005 (gmt 0)

10+ Year Member



Hello ppl,

Is there a way to tidy/clean-up html code, like adding " for atrribute values etc..?
So for instens:


<p class=MsoNormal><font size=3 face="Comic Sans MS"><span lang=NL
style='font-size:12.0pt;font-family:"Comic Sans MS"'>&nbsp;</span>

Wil be something like this:


<p class="MsoNormal"><font size="3" face="Comic Sans MS"><span lang="NL"
style="font-size:12.0pt;font-family: Comic Sans MS">&nbsp;</span>

This becouse i need to make it easy for the ppl who want to update the site, and make it harder for me :(.

Some other changes i can handle my self.. but this attribute thingy witht the value's is getting a bit over my head..

Any help would be nice.
Thx in advance.

dreamcatcher

11:20 am on Mar 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you want people to update the site, have you thought of looking into, or coding yourself a Content Management System (CMS)?

BlackDex

11:40 am on Mar 5, 2005 (gmt 0)

10+ Year Member



Well, we use a CMS atm..
But they don't know anything about HTML or whetever.
So they want to use WORD and save it as HTML (not my idea but owke).

This is possible, BUT, that HTML is so not standard, that i want to change some stuff of it.

Like the " that are missing from the attribute values etc..

I tryed to impliment Tidy, but that isn't an option, becouse it changes to much of the html, and the layout gets changed. So, i want to atleast make it more Standard HTML to include those " and change some other stuff that i want to.

But my prob is the part for the " for the values, how do i add them around the values and within the tags, i know there should be a 'simple' RegExp for that, but i can't figure it out (And i don't have a start-point ether).

coopster

11:43 am on Mar 5, 2005 (gmt 0)

WebmasterWorld Administrator 10+ Year Member



Have a look at the open source project "htmlarea", you can find it on sourceforge. It has javascript that peels out the MS junk when a user pastes into your <textarea>. You could always retro-fit the routines into your current CMS.

BlackDex

11:47 am on Mar 5, 2005 (gmt 0)

10+ Year Member



@Coopster: LoL.. that is what i use now.. And i had a hard time to implement that. And for some reason, they can't control it or whatever. So this has to be that last option i can think of.

But if HTMLArea has an option to peel it out.. mabye i can convert it to PHP, and try to use that, but as i can remeber, it used specific javascript features, so i doubt that will be an option :( .

But it must sure be possible trough PHP and parse that file/HTML.

If any of you have a example or a place to look at that would help me alot :).

Thx..

ergophobe

9:48 pm on Mar 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Could you use a tags and attributes "whitelist" and strip everything that's not on your list? How much of the Word formatting are you required to preserve?

As for Tidy, are you using Tidy with the default settings? Have you played with tidylib and PHP?

The available functions and settings for Tidy differ depending on PHP version. You can find out which ones you have with

if (extension_loaded('tidy'))
{
stuff
}

$x = get_extension_funcs('tidy');

Then you set tidy going:
tidy_parse_string($html);

Set all the options you want, one at a time:
tidy_setopt("output-xhtml", true);

Clean up the input:
tidy_clean_repair();

And assign it to a var:
$clean_html = tidy_get_output();

I don't think you can get Tidy to clean out everything, even with all the options for cleaning Word input, but it helps.

BlackDex

11:05 pm on Mar 6, 2005 (gmt 0)

10+ Year Member



The only thing that i need to have done, is the following.

Change: <a href=http://www.test.info target=_blank>
To: <a href="http://www.test.info" target="_blank">

So add those " around the values. And i don't want to change/add/remove tags like tidy adds/removes <span> etc.. becouse they arnt supposed to be there, becouse for some reason, it will change the layout of the page. And thats not what i want to happen.

If someone has any idea how this can be done trough tidy thats fine, but i tryed several settings, and it always changes the code/layout.

Don't get me wrong, my own sites are totaly XHTML 1.x standard, but this is not my own site, and i don't need/want to maintain it every time. So this should be the next best thing.

But im gona try some stuff out, ill post if it works.
But if anyone has a idea, that would be great :).

ironik

12:03 am on Mar 7, 2005 (gmt 0)

10+ Year Member



If it's only something simple you need, try this:

<?php
/**
* Add " quote chars to html attribute values
*
* @param string $html html to parse
* @param string $attr Attribute name
* @return boolean Returns TRUE
*
*/
function addAttributeQuotes($html, $attr)
{
$pattern = "/([\s]" . $attr . "=)([^\"][\w_-:;]+[^\">])/i";
return preg_replace($pattern, '$1"$2"', $html);
}

// Sample usage
$text = '<a href="http://whatever.com/" target=_blank>test link</a>';
echo addAttributeQuotes($text, 'target');
// returns <a href="http://whatever.com/" target="_blank">test link</a>
?>

It searches out attribute name/value pairs that don't have quotes and adds quotes to them. Beware though, if you have some plain text like target=this is a target it'll produce some weird quoting results. This is because I haven't extended the regex pattern to test that the matches are contained within < and > tag delimeters.

So it's a very simple function, hopefully it'll cover your needs (I needed to make one myself to attempt to make html XHTML 1.0 compliant)

ergophobe

8:27 am on Mar 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hmmm. I though Tidy with xhtml-output option set would actually quote attributes and close tags as required.

BlackDex

9:04 am on Mar 7, 2005 (gmt 0)

10+ Year Member



@ergophobe:
It does, but becouse it add's and/or removes some of the code (wichs i havent figured out yet), it changes the layout of the page, and thats the prob :(.

It isn't a dam easy task.. but i think im getting there :).
--

@ironik:
Thx.. im gona try it out :), and mold it to my useage.

BlackDex

4:27 pm on Mar 7, 2005 (gmt 0)

10+ Year Member



Well.. i figured out why Tidy changes the layout.
It trims what tidy thinks empty tags, wich are not empty.

And i can't find anything to fix this. No option or whatever to disable it.

ergophobe

7:39 pm on Mar 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Which tags does it trim? You can fix this for paragraph tags

[tidy.sourceforge.net...]

BlackDex

7:52 pm on Mar 7, 2005 (gmt 0)

10+ Year Member



as far as i can see it is only the <span> tag..

And i tryed that option also, it has no effect.

BlackDex

12:21 pm on Mar 9, 2005 (gmt 0)

10+ Year Member



Well i have got the following a litle bit working.


function addAttributeQuotes($html, $attr)
{
$pattern = "/([\s]". $attr . "=)([^\"\'][\w\_\-\:\;]+[^\"\'>])/im";
return preg_replace($pattern, '$1"$2"', $html);
}

function fixHTML($html)
{
$attr_array = array ('class',
'lang',
'size',
'cellpadding',
'cellspacing',
'width',
'height',
'bgcolor',
'type',
);
foreach ($attr_array as $attr)
{
$html = addAttributeQuotes($html, $attr);
}
return $html;
}

If for example i use this part of HTML:


<p class=MsoNormal><font size=3 face="Comic Sans MS"><span lang=NL
style='font-size:12.0pt;font-family:"Comic Sans MS"'>&nbsp;</span></font></p>

It will change it to:


<p class="MsoNormal"><font size=3 face="Comic Sans MS"><span lang="NL
"style='font-size:12.0pt;font-family:"Comic Sans MS"'>&nbsp;</span></font></p>

Notice that the 'size' attribute doesn't get changed :( this also happens for 'cellspacing' and 'cellpadding' etc.. so some if it works.. but not everything :(.
What do i have wrong in that code that it will not work?

Thx in advance

BlackDex

9:11 am on Mar 11, 2005 (gmt 0)

10+ Year Member



Owkay... i posted a question about this on the PHP News group, and with some help i got this working :).


<?php
function tag_rep($tag)
{
return preg_replace('/(?<!\<)(\S+)\s*=\s*(?<![\'"])([^\s\'"]+)(?![\'"])/','\1="\2"',$tag);
}

$html="<p class=MsoNormal id=par><font size=3 face=\"Comic Sans
MS\"><span lang=NL style='font-size:12.0pt;font-family:\"Comic Sans
MS\"'><a
href=http://www.php.net/index.php>&nbsp;key=value&nbsp;</a></span></font></p>";

echo 'Normal HTML:<br><textarea cols="70" rows="10">';
echo $html;
echo "</textarea><br><br>";

$improved_html = preg_replace('/\<(.*)\>/Ueis','"<".tag_rep("\1").">"',$html);
echo 'Improved HTML:<br><textarea cols="70" rows="10">';
echo str_replace("\\'","'",$improved_html);
echo "</textarea>";
?>

ergophobe

4:32 pm on Mar 17, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Thanks for following up. I haven't tried it, but it looks handy.