homepage Welcome to WebmasterWorld Guest from 54.237.38.30
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
Forum Library, Charter, Moderators: coopster & jatar k

PHP Server Side Scripting Forum

    
Remove Special Characters
Completely remove, don't convert to HTML
ZakAltF4



 
Msg#: 4417500 posted 5:19 pm on Feb 14, 2012 (gmt 0)

I am searching for a way to STRIP my string of special HTML chracters. From what I understand HTMLEntities will replace said characters with their HTML-safe counterpart. As I am dynamically building meta tags from content within our CMS, and there is 16,000 products. So going through every one and removing special characters is unreasonable, and replacing them with HTML in a meta tag is not good either. So is there a way to eradicate the following special charatcers that show up when I build my tag?:



<meta name="description" content="...... feature a Sanitized® ....." />

<meta name="description" content="...... Flexible “wide-mouth” ........"/>




It's not a terrible problem, but I like things neat and tidy if you know what I mean ... Thanks!

-- Zak

 

penders

WebmasterWorld Senior Member penders us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4417500 posted 6:17 pm on Feb 14, 2012 (gmt 0)

If you are building meta tags from DB content which already has these 'special' characters then presumably your pages are already of a suitable encoding since you are presumably displaying these characters OK on the page? ...Then I don't see why you would need to strip these special characters? Although from your HTML dump it would seem you have a mismatch of encodings?

Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?

ZakAltF4



 
Msg#: 4417500 posted 8:42 pm on Feb 14, 2012 (gmt 0)

I am using the Magento Platform. The way that the encoding is displayed on the page is handled through Magento built in filters. We are building a Meta CRON to run and grab part of the "short" description and Product name for meta description ... The problem isn't "It's already encoded" .. The problem is that Magento already filters these for pagination. Since I don't wish to search through Magento's extremely large code base and search for their filters ... (And I probably wouldn't recognize them if I saw them anyway since Magento uses it's own codebase that is set up as classes and functions) - I need a straight PHP way to do this for the weekly CRON job.



Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?


Google doesn't recognize things like:
&copy; I am pretty sure ... Not any more than it would recognize “wide-mouth” ... Not to mention using valuable Meta space with a series of "HTML safe" encoding .. So there is no harm in filtering these just to create a Meta tag... Not from my standpoint. Also the original text in the database remains untouched, it's just used to generate this string.

-- Zak

httpwebwitch

WebmasterWorld Administrator httpwebwitch us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4417500 posted 5:37 am on Feb 15, 2012 (gmt 0)

I recall there's a way to do this using iconv() with the //IGNORE option. The syntax is weird and the PHP documentation isn't so good, the best advice for using iconv() will be found sifting through the blogosphere.

penders

WebmasterWorld Senior Member penders us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4417500 posted 8:13 am on Feb 15, 2012 (gmt 0)

Although from your HTML dump it would seem you have a mismatch of encodings?


Sorry, I was writing junk there! WebmasterWorld doesn't set a character encoding for pages - defaults to ISO-8859-1 for me (not UTF-8) - I see what 'special' characters you are talking about now!

4serendipity

10+ Year Member



 
Msg#: 4417500 posted 12:33 am on Feb 16, 2012 (gmt 0)

You could try using a combination of str_replace() and chr()




$original = array(chr(10),
chr(13),
chr(147),
chr(148),
chr(151));

$fixed = array(" ",
" ",
'"',
'"',
'-');

return str_replace($original, $fixed, $input);

GiantSquid



 
Msg#: 4417500 posted 12:59 am on Feb 18, 2012 (gmt 0)

You could use something like this. If there are only a few characters you are concerned about, you could use a str_replace for each one e.g.

$metastring = str_replace("'", '', $metastring);
$metastring = str_replace("&", '', $metastring);


Otherwise use a preg_replace with Regex like this which would remove all characters except letters, numbers, periods and hyphens

$newmetastring = preg_replace('/[^A-Za-z0-9\s.\s-]/','',$metastring);

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4417500 posted 2:38 am on Feb 18, 2012 (gmt 0)

Unitame

ZakAltF4



 
Msg#: 4417500 posted 6:26 pm on Feb 20, 2012 (gmt 0)

Since it really doesn't matter all that much in the description ... I used the same ole

preg_replace('/[^A-Za-z0-9\s.\s-]/','',$meta_ds);

The use of commas and such in Meta are, in my opinion, moot anyway. Thanks All for your input!

penders

WebmasterWorld Senior Member penders us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4417500 posted 6:50 pm on Feb 20, 2012 (gmt 0)

The use of commas and such in Meta are, in my opinion, moot anyway.


Well maybe, but the description is intended to be a human readable description since it's not used by the search engines for indexing. I quite like httpwebwitch's suggestion above to use iconv()... something like...

$plainDescription = iconv('UTF-8', 'ASCII//IGNORE', $description);
rocknbil

WebmasterWorld Senior Member rocknbil us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4417500 posted 4:48 pm on Feb 21, 2012 (gmt 0)

If you go with the regex, you may want to modify that a little. First, your last dash in a character class represents a range (see your a-z, etc), and it may confuse the regex engine. Escape it to be safe.

Same is true of the dot character. Dot means "any character," and within a class will it be a literal dot or "any character"? Escape it to be sure.

There's no real reason to eliminate commas, is there?

Though A-Za-z works, you can achieve the same with the case-insensitive modifier.

0-9 and \d are equivalent. The only real reason to use a digit range is if you want a specific range, like 1-5.

Don't know why you have whitespace \s twice, if you intended to indicate "a space followed by a dot followed by another space" that's a different regex outside the class.

All together, letting the commas live,

preg_replace('/[^a-z\d\s\.,\-]/i','',$meta_ds);

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / PHP Server Side Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved