Forum Moderators: coopster

Message Too Old, No Replies

Special characters in remote XML breaking function

         

csdude55

12:16 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I use cURL to read several remote XML feeds, mostly from big news sites, then use a simple function to convert those feeds to an array. I've recently discovered that some of them do weird things to my system.

Here's an exact example of one that breaks:

$contents = <<<EOF
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href='/pb/resources/xsl/rss.xsl'?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/" xmlns:wp="http://www.washingtonpost.com/wp-namespace" version="2.0">
<channel>
<title>Politics</title>
<atom:link href="http://www.washingtonpost.com/pb/politics/?resType=rss" rel="self" type="application/rss+xml"/>
<link>http://www.washingtonpost.com/pb/politics/</link>
<description>Post Politics from The Washington Post is the source for political news headlines, in-depth politics coverage and political opinion, plus breaking news on the Obama administration and White House, Congress, the Supreme Court, elections and more.</description>
<language>en-US</language>
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>

<item>
<title>Bannon wants a war on Washington. Now he’s part of one inside the White House.</title>
<link>https://www.washingtonpost.com/politics/bannon-wants-a-war-on-washington-now-hes-part-of-one-inside-the-white-house/2017/04/06/ec4a135a-1ada-11e7-9887-1a5314b56a08_story.html</link>
<dc:creator>Ashley Parker</dc:creator>
<description>The escalating fight pits the self-described nationalist against Trump’s son -in-law, Jared Kushner. &lt;media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_90w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="90"/&gt; &lt;media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="606"/&gt; &lt;media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_1024w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="1024"/&gt; </description>
<media:thumbnail url="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="606"/>
<media:group>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_90w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="90"/>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="606"/>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_1024w/2010-2019/WashingtonPost/2017/03/31/National-Politics/Images/Botsford170331Trump13470.JPG" width="1024"/>
</media:group>
<guid>https://www.washingtonpost.com/politics/bannon-wants-a-war-on-washington-now-hes-part-of-one-inside-the-white-house/2017/04/06/ec4a135a-1ada-11e7-9887-1a5314b56a08_story.html</guid>
<wp:uuid>ec4a135a-1ada-11e7-9887-1a5314b56a08</wp:uuid>
</item>

<item>
<title>At Mar-a-Lago, Trump welcomes Chinas Xi in first summit</title>
<link>https://www.washingtonpost.com/politics/at-mar-a-lago-trump-to-welcome-chinas-xi-for-high-stakes-inaugural-summit/2017/04/06/0235cdd0-1ac2-11e7-bcc2-7d1a0973e7b2_story.html</link>
<dc:creator><![CDATA[David Nakamura]]></dc:creator>
<description><![CDATA[The meeting at the presidents winter estate will be dominated by talks on North Korea, trade, officials said.]]></description>
<media:thumbnail url="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2017/04/06/National-Politics/Images/Trump_US_China_15990-3ee27.jpg" width="606"/>
<media:group>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_90w/2010-2019/WashingtonPost/2017/04/06/National-Politics/Images/Trump_US_China_15990-3ee27.jpg" width="90"/>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_606w/2010-2019/WashingtonPost/2017/04/06/National-Politics/Images/Trump_US_China_15990-3ee27.jpg" width="606"/>
<media:content medium="image" type="image/jpeg" url="https://img.washingtonpost.com/rf/image_1024w/2010-2019/WashingtonPost/2017/04/06/National-Politics/Images/Trump_US_China_15990-3ee27.jpg" width="1024"/>
</media:group>
<guid><![CDATA[https://www.washingtonpost.com/politics/at-mar-a-lago-trump-to-welcome-chinas-xi-for-high-stakes-inaugural-summit/2017/04/06/0235cdd0-1ac2-11e7-bcc2-7d1a0973e7b2_story.html]]></guid>
<wp:uuid><![CDATA[0235cdd0-1ac2-11e7-bcc2-7d1a0973e7b2]]></wp:uuid>
</item>
</channel>
</rss>
EOF;

echo $contents;
// Yah, so far so good

$rss = xml2array($contents);

print_r($rss);
// prints a blank array?

// XML -> array
function xml2array($contents) {
if (!$contents) return array();

$xml = simplexml_load_string($contents);
$json = json_encode($xml);
$xml_array = json_decode($json, TRUE);

return $xml_array;
}


After digging and digging and digging, it turned out that the problem here was with the ' in Chinas. With that in place I just get an empty array returned, but when I remove it everything parses fine.

What's particularly weird is, in the feed, the exact same character in the first <item> shows up as he’s (instead of he's), so for some reason it's coming from the source weird.

And it's not just the '; a feed from Fox News included a link with .html#&_whatever and the & broke the function, returning blank. I changed it to &amp; and it worked fine... but what's weird is the exact same feed had a separate & in it that didn't cause a problem!

Very weird and frustrating.

But anyway, after a few days of hunting and pecking, I've finally found that this [u]almost[/u] fixed it:

$contents = preg_replace('#<!\[CDATA\[(.+?)\]\]>#s', '$1', $contents);
$contents = htmlspecialchars($contents, ENT_IGNORE, 'UTF-8');

$rss = xml2array($contents);


I say "almost" because it changed all of the < and > tag names to &lt; and &gt;... so, <item> became &lt;item&gt;. So, I had to add:

$contents = preg_replace('#<!\[CDATA\[(.+?)\]\]>#s', '$1', $contents);
$contents = htmlspecialchars($contents, ENT_IGNORE, 'UTF-8');

$contents = str_replace('&lt;', '<', $contents);
$contents = str_replace('&gt;', '>', $contents);

//// Is this faster or better than two str_replace() commands?
// $contents = str_replace(array('&lt;', '&gt;'), array('<', '>'), $contents);

$rss = xml2array($contents);


But this is a concern because, as you can see, the first <item> intentionally used &lt; and &gt; where the second one did not, so I'm worried that this might break something that I haven't seen yet.

AND, there's the fact that it just removed the offending character (in this case a '). So in the Fox News example with an & in the link, that might cause the link to break.

I THINK that I could change ENT_IGNORE to ENT_SUBSTITUTE and wouldn't need the two str_replace() commands, but I'm still using PHP 5.3.x and that wasn't introduced until 5.4, so I can't try it yet.

So until I get to that point, do you guys know of a better way to fix this ongoing problem, or anything I should be doing to prevent future, as-of-yet unknown, problems with remote XML feeds?

keyplyr

12:33 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Have you tried escaping the ' ? Example: China\\s
I'm still using PHP 5.3.x and that wasn't introduced until 5.4, so I can't try it yet
I invite you to strongly consider upgrading to PHP 5.6 or better yet PHP 7.0. Any version earlier than 5.6 is being degraded with no security support.

csdude55

1:07 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Did you mean to escape after the ', too? Or was that a typo? I tried escaping before it (and converting it from a right- or left-accent to a regular single-quote) but that just ended up with an empty array, too.

Then, just now, I had this from Google News:

<item>
<title>Trump confronts the contradictions of his foreign policy rhetoric - Washington Post</title>
<link>http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNHVD0Mmv0ZKMl4--9psCuv2sK1ASg&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52779449477367&amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;url=https://www.washingtonpost.com/politics/trump-confronts-the-contradictions-of-his-foreign-policy-rhetoric/2017/04/07/c1a32dfe-1bc4-11e7-855e-4824bbb5d748_story.html</link>
<guid isPermaLink="false">tag:news.google.com,2005:cluster=52779449477367</guid>
<category>Top Stories</category>
<pubDate>Sat, 08 Apr 2017 00:09:14 GMT</pubDate>
<description>&lt;table border=&quot;0&quot; cellpadding=&quot;2&quot; cellspacing=&quot;7&quot; style=&quot;vertical-align:top;&quot;&gt;&lt;tr&gt;&lt;td width=&quot;80&quot; align=&quot;center&quot; valign=&quot;top&quot;&gt;&lt;font style=&quot;font-size:85%;font-family:arial,sans-serif&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNHVD0Mmv0ZKMl4--9psCuv2sK1ASg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=https://www.washingtonpost.com/politics/trump-confronts-the-contradictions-of-his-foreign-policy-rhetoric/2017/04/07/c1a32dfe-1bc4-11e7-855e-4824bbb5d748_story.html&quot;&gt;&lt;img src=&quot;//t0.gstatic.com/images?q=tbn:ANd9GcSE2FX3L6llZ5Skqf03IYv4UMvkvL1WUCJahyDywUJAZLtetigyLwQs5wynLNd0GPicAhfEcgwp&quot; alt=&quot;&quot; border=&quot;1&quot; width=&quot;80&quot; height=&quot;80&quot;&gt;&lt;br&gt;&lt;font size=&quot;-2&quot;&gt;Washington Post&lt;/font&gt;&lt;/a&gt;&lt;/font&gt;&lt;/td&gt;&lt;td valign=&quot;top&quot; class=&quot;j&quot;&gt;&lt;font style=&quot;font-size:85%;font-family:arial,sans-serif&quot;&gt;&lt;br&gt;&lt;div style=&quot;padding-top:0.8em;&quot;&gt;&lt;img alt=&quot;&quot; height=&quot;1&quot; width=&quot;1&quot;&gt;&lt;/div&gt;&lt;div class=&quot;lh&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNHVD0Mmv0ZKMl4--9psCuv2sK1ASg&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=https://www.washingtonpost.com/politics/trump-confronts-the-contradictions-of-his-foreign-policy-rhetoric/2017/04/07/c1a32dfe-1bc4-11e7-855e-4824bbb5d748_story.html&quot;&gt;&lt;b&gt;Trump confronts the contradictions of his foreign policy rhetoric&lt;/b&gt;&lt;/a&gt;&lt;br&gt;&lt;font size=&quot;-1&quot;&gt;&lt;b&gt;&lt;font color=&quot;#6f6f6f&quot;&gt;Washington Post&lt;/font&gt;&lt;/b&gt;&lt;/font&gt;&lt;br&gt;&lt;font size=&quot;-1&quot;&gt;President Trump found himself in unfamiliar territory Friday, generally praised by members of the political and foreign policy establishments but attacked from some quarters of Trump nation for seeming to betray the “America First” pledges that carried &lt;b&gt;...&lt;/b&gt;&lt;/font&gt;&lt;br&gt;&lt;font size=&quot;-1&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNEn2tZ9HTnHOtA4m2-mF3m2iIzK3A&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=https://www.nytimes.com/2017/04/07/us/politics/syria-bombing-republicans-trump.html&quot;&gt;GOP Lawmakers, Once Skeptical of Obama Plan to Hit Syria, Back Trump&lt;/a&gt;&lt;font size=&quot;-1&quot; color=&quot;#6f6f6f&quot;&gt;&lt;nobr&gt;New York Times&lt;/nobr&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font size=&quot;-1&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNHHWWBs7v-JtysRr6W89XdMiXvxgw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.reuters.com/article/us-mideast-crisis-syria-idUSKBN1782S0?il%3D0&quot;&gt;Russia warns of serious consequences from US strike in Syria&lt;/a&gt;&lt;font size=&quot;-1&quot; color=&quot;#6f6f6f&quot;&gt;&lt;nobr&gt;Reuters&lt;/nobr&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font size=&quot;-1&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNGrX8kv5-VxMR2cgPlX-H9sM-0W6w&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.politico.com/story/2017/04/trump-syria-strikes-debate-237025&quot;&gt;Inside Trump&amp;#39;s three days of debate on Syria&lt;/a&gt;&lt;font size=&quot;-1&quot; color=&quot;#6f6f6f&quot;&gt;&lt;nobr&gt;Politico&lt;/nobr&gt;&lt;/font&gt;&lt;/font&gt;&lt;br&gt;&lt;font size=&quot;-1&quot; class=&quot;p&quot;&gt;&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNHirqwm4oRjERX3T6sqIIM9nclc3g&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.foxnews.com/us/2017/04/07/ghastly-mages-syrian-attack-led-to-trump-about-face.html&quot;&gt;&lt;nobr&gt;Fox News&lt;/nobr&gt;&lt;/a&gt;&amp;nbsp;-&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNF6l61PY93HX1-LuWqL7sP0n2eIuw&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.huffingtonpost.com/entry/syria-assad-supporters_us_58e7ff42e4b05413bfe316df&quot;&gt;&lt;nobr&gt;Huffington Post&lt;/nobr&gt;&lt;/a&gt;&amp;nbsp;-&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNEJuRv6-lxPgwcU1Ip0x99SB3Xy9w&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.nydailynews.com/news/national/assad-u-s-missile-strike-fuel-syria-civil-war-article-1.3029925&quot;&gt;&lt;nobr&gt;New York Daily News&lt;/nobr&gt;&lt;/a&gt;&amp;nbsp;-&lt;a href=&quot;http://news.google.com/news/url?sa=t&amp;amp;fd=R&amp;amp;ct2=us&amp;amp;usg=AFQjCNFWRWxiqiySOcf4fg0M2bB2LtB81A&amp;amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;amp;cid=52779449477367&amp;amp;ei=KDboWMDAKYzM3gHZrYHoCw&amp;amp;url=http://www.latimes.com/politics/la-fg-pol-syria-analysis-20170407-story.html&quot;&gt;&lt;nobr&gt;Los Angeles Times&lt;/nobr&gt;&lt;/a&gt;&lt;/font&gt;&lt;br&gt;&lt;font class=&quot;p&quot; size=&quot;-1&quot;&gt;&lt;a class=&quot;p&quot; href=&quot;http://news.google.com/news/more?ncl=d1izKv1bgpL3uTM_0aja7DTaWKFHM&amp;amp;authuser=0&amp;amp;ned=us&amp;amp;topic=h&quot;&gt;&lt;nobr&gt;&lt;b&gt;all 10,241 news articles&amp;nbsp;&amp;raquo;&lt;/b&gt;&lt;/nobr&gt;&lt;/a&gt;&lt;/font&gt;&lt;/div&gt;&lt;/font&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;</description>
</item>


That's the raw feed, not manipulated on my end. Using htmlspecialchars(), though, I end up with a mess like:

<description>&amp;lt;table border=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot; cellspacing=&amp;quot;7&amp;quot; style=&amp;quot;vertical-align:top;&amp;quot;&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;td width=&amp;quot;80&amp;quot; align=&amp;quot;center&amp;quot; valign=&amp;quot;top&amp;quot;&amp;gt;&amp;lt;font style=&amp;quot;font-size:85%;...


So &lt; becomes &amp;lt;, and so on. I thought htmlspecialchars() was smart enough to recognize that in advance, but I guess not.

Untested, but maybe?

$contents = preg_replace('#<!\[CDATA\[(.+?)\]\]>#s', '$1', $contents);

$contents = htmlspecialchars_decode($contents);
$contents = htmlspecialchars($contents, ENT_IGNORE, 'UTF-8');

$contents = str_replace('&lt;', '<', $contents);
$contents = str_replace('&gt;', '>', $contents);


The whole thing is just become a huge pain >:-(


I invite you to strongly consider upgrading to PHP 5.6 or better yet PHP 7.0. Any version earlier than 5.6 is being degraded with no security support.


That's on the list of things-to-do, but since 5.6 stopped recognizing file_get_contents() it's turned into a complete rebuild. I have a separate thread on that, but no one has really told me whether my quick-fix would work:

[webmasterworld.com...]

lucy24

1:22 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



he’s (instead of he's), so for some reason it's coming from the source weird.

Or the source isnt reading its own database correctly, because what youve got there is UTF-8 being reinterpreted as 1252 (Windows-Latin1). The leading is a dead giveaway that its something in the Latin-1 family. Thats assuming you meant hes with curly apostrophe* &rsquo; not the typewriter apostrophe &#39;

a feed from Fox News included a link with .html#&_whatever and the & broke the function, returning blank. I changed it to &amp; and it worked fine... but what's weird is the exact same feed had a separate & in it that didn't cause a problem!

By separate do you mean free-standing? Thats correct html behavior: if the & is immediately followed by other stuff, its interpeted as an entity; if its sitting by itself, it remains a literal ampersand.


* Aaaand... now I have learned that &apos; is not an Official HTML Entity, although &quot; is. Who knew.

csdude55

1:40 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



By separate do you mean free-standing? Thats correct html behavior: if the & is immediately followed by other stuff, its interpeted as an entity; if its sitting by itself, it remains a literal ampersand.


I'm afraid I don't recall, those showed up last night in a live feed and I didn't save them. But the feed right now includes &#8212; and that's going through without breaking the xml2array() function... although one <item> mistakenly says &amp;#8212;...

It's worth noting that Fox does start with:

<?xml version="1.0" encoding="iso-8859-1" ?>


while Washington Post says:

<?xml version="1.0" encoding="UTF-8"?>


My site is UTF-8 encoded, but oddly, Washington Post's feed includes "weird" characters like “deceptive”.

Google doesn't specify an encoding.

csdude55

2:25 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Just to update... so far, this seems to be working:

$contents = preg_replace('#<!\[CDATA\[(.+?)\]\]>#s', '$1', $contents);

//// When I update to PHP 5.6.x
//$contents = htmlspecialchars($contents, ENT_SUBSTITUTE, 'UTF-8');

// For now
$contents = htmlspecialchars($contents, ENT_IGNORE, 'UTF-8');

$contents = str_replace('&lt;', '<', $contents);
$contents = str_replace('&gt;', '>', $contents);

$rss = xml2array($contents);

// Later in the script, when I'm reading the results of the XML feed
if ($rss['channel']['item']) {
foreach ($rss['channel']['item'] as $key) {
$key['title'] = htmlspecialchars_decode($key['title']);
$key['link'] = htmlspecialchars_decode($key['link']);
$key['description'] = htmlspecialchars_decode($key['description']);

// do whatever
}
}


I'll have to keep an eye on it for a few days, I think, to make sure no bugs show through, but so far so good. If you guys have any suggestions on how to make it more functionable or less bug-prone, I'd love to hear it! :D :D :D

lucy24

5:35 am on Apr 8, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Google doesn't specify an encoding.

Encoding can be set at the server level, and then if you look at an individual page's HTML you won't know; you'd have to look at the response headers. An ordinary browser deals with this header information--but all the browser has to do is display text here-and-now, not shove it into a database where it may be jostling up against text that started out in other encodings. If you're collecting text in various encodings, and collapsing it all into a single database, anything can happen. So you need to pinpoint whether things were already garbled before they reached you. They can always be de-garbled, but you may not consider it worth the trouble.

My site is UTF-8 encoded, but oddly, Washington Post's feed includes "weird" characters like “deceptive”.

Then its not your sites fault but theirs ;) (Hm. The third byte in what I assume is a curly close-quote &rdquo; is a non-displaying character--even, apparently, in Windows-Latin-1, which is a superset of ISO-Latin-1. Oh well.)

It happens. Everyone has visited sites with huge headlines that say &mdash; and &rsquo; as loud as life. Something, somewhere is hypercorrecting. I've actually got one conversion routine that specifies
&(?!#|[mn]dash|nbsp|amp|shy)
>>
&amp;
so & don't get expanded unless I'm relatively sure they were meant to be literal ampersands. For most purposes, the list of exclusions would even be shorter, like
&(?![#\w])
What a good thing the abbreviation &c. has gone out of fashion.