homepage Welcome to WebmasterWorld Guest from 54.211.201.65
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Remove Special Characters
Completely remove, don't convert to HTML
ZakAltF4




msg:4417502
 5:19 pm on Feb 14, 2012 (gmt 0)

I am searching for a way to STRIP my string of special HTML chracters. From what I understand HTMLEntities will replace said characters with their HTML-safe counterpart. As I am dynamically building meta tags from content within our CMS, and there is 16,000 products. So going through every one and removing special characters is unreasonable, and replacing them with HTML in a meta tag is not good either. So is there a way to eradicate the following special charatcers that show up when I build my tag?:



<meta name="description" content="...... feature a Sanitized® ....." />

<meta name="description" content="...... Flexible “wide-mouth” ........"/>




It's not a terrible problem, but I like things neat and tidy if you know what I mean ... Thanks!

-- Zak

 

penders




msg:4417542
 6:17 pm on Feb 14, 2012 (gmt 0)

If you are building meta tags from DB content which already has these 'special' characters then presumably your pages are already of a suitable encoding since you are presumably displaying these characters OK on the page? ...Then I don't see why you would need to strip these special characters? Although from your HTML dump it would seem you have a mismatch of encodings?

Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?

ZakAltF4




msg:4417597
 8:42 pm on Feb 14, 2012 (gmt 0)

I am using the Magento Platform. The way that the encoding is displayed on the page is handled through Magento built in filters. We are building a Meta CRON to run and grab part of the "short" description and Product name for meta description ... The problem isn't "It's already encoded" .. The problem is that Magento already filters these for pagination. Since I don't wish to search through Magento's extremely large code base and search for their filters ... (And I probably wouldn't recognize them if I saw them anyway since Magento uses it's own codebase that is set up as classes and functions) - I need a straight PHP way to do this for the weekly CRON job.



Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?


Google doesn't recognize things like:
&copy; I am pretty sure ... Not any more than it would recognize “wide-mouth” ... Not to mention using valuable Meta space with a series of "HTML safe" encoding .. So there is no harm in filtering these just to create a Meta tag... Not from my standpoint. Also the original text in the database remains untouched, it's just used to generate this string.

-- Zak

httpwebwitch




msg:4417756
 5:37 am on Feb 15, 2012 (gmt 0)

I recall there's a way to do this using iconv() with the //IGNORE option. The syntax is weird and the PHP documentation isn't so good, the best advice for using iconv() will be found sifting through the blogosphere.

penders




msg:4417827
 8:13 am on Feb 15, 2012 (gmt 0)

Although from your HTML dump it would seem you have a mismatch of encodings?


Sorry, I was writing junk there! WebmasterWorld doesn't set a character encoding for pages - defaults to ISO-8859-1 for me (not UTF-8) - I see what 'special' characters you are talking about now!

4serendipity




msg:4418214
 12:33 am on Feb 16, 2012 (gmt 0)

You could try using a combination of str_replace() and chr()




$original = array(chr(10),
chr(13),
chr(147),
chr(148),
chr(151));

$fixed = array(" ",
" ",
'"',
'"',
'-');

return str_replace($original, $fixed, $input);

GiantSquid




msg:4419012
 12:59 am on Feb 18, 2012 (gmt 0)

You could use something like this. If there are only a few characters you are concerned about, you could use a str_replace for each one e.g.

$metastring = str_replace("'", '', $metastring);
$metastring = str_replace("&", '', $metastring);


Otherwise use a preg_replace with Regex like this which would remove all characters except letters, numbers, periods and hyphens

$newmetastring = preg_replace('/[^A-Za-z0-9\s.\s-]/','',$metastring);

lucy24




msg:4419034
 2:38 am on Feb 18, 2012 (gmt 0)

Unitame

ZakAltF4




msg:4419633
 6:26 pm on Feb 20, 2012 (gmt 0)

Since it really doesn't matter all that much in the description ... I used the same ole

preg_replace('/[^A-Za-z0-9\s.\s-]/','',$meta_ds);

The use of commas and such in Meta are, in my opinion, moot anyway. Thanks All for your input!

penders




msg:4419646
 6:50 pm on Feb 20, 2012 (gmt 0)

The use of commas and such in Meta are, in my opinion, moot anyway.


Well maybe, but the description is intended to be a human readable description since it's not used by the search engines for indexing. I quite like httpwebwitch's suggestion above to use iconv()... something like...

$plainDescription = iconv('UTF-8', 'ASCII//IGNORE', $description);
rocknbil




msg:4420032
 4:48 pm on Feb 21, 2012 (gmt 0)

If you go with the regex, you may want to modify that a little. First, your last dash in a character class represents a range (see your a-z, etc), and it may confuse the regex engine. Escape it to be safe.

Same is true of the dot character. Dot means "any character," and within a class will it be a literal dot or "any character"? Escape it to be sure.

There's no real reason to eliminate commas, is there?

Though A-Za-z works, you can achieve the same with the case-insensitive modifier.

0-9 and \d are equivalent. The only real reason to use a digit range is if you want a specific range, like 1-5.

Don't know why you have whitespace \s twice, if you intended to indicate "a space followed by a dot followed by another space" that's a different regex outside the class.

All together, letting the commas live,

preg_replace('/[^a-z\d\s\.,\-]/i','',$meta_ds);

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved