Welcome to WebmasterWorld Guest from 54.166.152.121

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Remove Special Characters

Completely remove, don't convert to HTML

     
5:19 pm on Feb 14, 2012 (gmt 0)



I am searching for a way to STRIP my string of special HTML chracters. From what I understand HTMLEntities will replace said characters with their HTML-safe counterpart. As I am dynamically building meta tags from content within our CMS, and there is 16,000 products. So going through every one and removing special characters is unreasonable, and replacing them with HTML in a meta tag is not good either. So is there a way to eradicate the following special charatcers that show up when I build my tag?:



<meta name="description" content="...... feature a Sanitized® ....." />

<meta name="description" content="...... Flexible “wide-mouth” ........"/>




It's not a terrible problem, but I like things neat and tidy if you know what I mean ... Thanks!

-- Zak
6:17 pm on Feb 14, 2012 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



If you are building meta tags from DB content which already has these 'special' characters then presumably your pages are already of a suitable encoding since you are presumably displaying these characters OK on the page? ...Then I don't see why you would need to strip these special characters? Although from your HTML dump it would seem you have a mismatch of encodings?

Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?
8:42 pm on Feb 14, 2012 (gmt 0)



I am using the Magento Platform. The way that the encoding is displayed on the page is handled through Magento built in filters. We are building a Meta CRON to run and grab part of the "short" description and Product name for meta description ... The problem isn't "It's already encoded" .. The problem is that Magento already filters these for pagination. Since I don't wish to search through Magento's extremely large code base and search for their filters ... (And I probably wouldn't recognize them if I saw them anyway since Magento uses it's own codebase that is set up as classes and functions) - I need a straight PHP way to do this for the weekly CRON job.



Aren't you in danger of messing up the text by simply striping these characters?

...HTMLEntities will replace said characters with their HTML-safe counterpart.

Only if you specifically do this. Is there any harm if you do?


Google doesn't recognize things like:
&copy;
I am pretty sure ... Not any more than it would recognize
“wide-mouth”
... Not to mention using valuable Meta space with a series of "HTML safe" encoding .. So there is no harm in filtering these just to create a Meta tag... Not from my standpoint. Also the original text in the database remains untouched, it's just used to generate this string.

-- Zak
5:37 am on Feb 15, 2012 (gmt 0)

WebmasterWorld Administrator httpwebwitch is a WebmasterWorld Top Contributor of All Time 10+ Year Member



I recall there's a way to do this using iconv() with the //IGNORE option. The syntax is weird and the PHP documentation isn't so good, the best advice for using iconv() will be found sifting through the blogosphere.
8:13 am on Feb 15, 2012 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



Although from your HTML dump it would seem you have a mismatch of encodings?


Sorry, I was writing junk there! WebmasterWorld doesn't set a character encoding for pages - defaults to ISO-8859-1 for me (not UTF-8) - I see what 'special' characters you are talking about now!
12:33 am on Feb 16, 2012 (gmt 0)

10+ Year Member



You could try using a combination of str_replace() and chr()




$original = array(chr(10),
chr(13),
chr(147),
chr(148),
chr(151));

$fixed = array(" ",
" ",
'"',
'"',
'-');

return str_replace($original, $fixed, $input);
12:59 am on Feb 18, 2012 (gmt 0)



You could use something like this. If there are only a few characters you are concerned about, you could use a str_replace for each one e.g.

$metastring = str_replace("'", '', $metastring);
$metastring = str_replace("&", '', $metastring);


Otherwise use a preg_replace with Regex like this which would remove all characters except letters, numbers, periods and hyphens

$newmetastring = preg_replace('/[^A-Za-z0-9\s.\s-]/','',$metastring);
2:38 am on Feb 18, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



Unitame
6:26 pm on Feb 20, 2012 (gmt 0)



Since it really doesn't matter all that much in the description ... I used the same ole

preg_replace('/[^A-Za-z0-9\s.\s-]/','',$meta_ds);

The use of commas and such in Meta are, in my opinion, moot anyway. Thanks All for your input!
6:50 pm on Feb 20, 2012 (gmt 0)

WebmasterWorld Senior Member penders is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



The use of commas and such in Meta are, in my opinion, moot anyway.


Well maybe, but the description is intended to be a human readable description since it's not used by the search engines for indexing. I quite like httpwebwitch's suggestion above to use iconv()... something like...

$plainDescription = iconv('UTF-8', 'ASCII//IGNORE', $description);
4:48 pm on Feb 21, 2012 (gmt 0)

WebmasterWorld Senior Member rocknbil is a WebmasterWorld Top Contributor of All Time 10+ Year Member



If you go with the regex, you may want to modify that a little. First, your last dash in a character class represents a range (see your a-z, etc), and it may confuse the regex engine. Escape it to be safe.

Same is true of the dot character. Dot means "any character," and within a class will it be a literal dot or "any character"? Escape it to be sure.

There's no real reason to eliminate commas, is there?

Though A-Za-z works, you can achieve the same with the case-insensitive modifier.

0-9 and \d are equivalent. The only real reason to use a digit range is if you want a specific range, like 1-5.

Don't know why you have whitespace \s twice, if you intended to indicate "a space followed by a dot followed by another space" that's a different regex outside the class.

All together, letting the commas live,

preg_replace('/[^a-z\d\s\.,\-]/i','',$meta_ds);
 

Featured Threads

Hot Threads This Week

Hot Threads This Month