[Realtime-feed-users] XML-based Atom Feed format problems with malformed UTF-8 characters

Mon Mar 13 10:46:39 UTC 2017

Recently sent a message using a web-based contact form regarding this 
issue.  If this arrives in duplicate, my apologies.  Just discovered 
this list and thought I'd mention it here as well.

The XML-Based Atom feed ( event us100087xb, 
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.atom 
or perhaps try monthly if the entry has moved off the weekly list ) has 
a malformed UTF-8 character U+00E7 (lower-case 'c' with cedilla).  It is 
incorrectly included in the feed as a raw single-byte with a value of 
0xE7, whereas it should be included as an escaped "character entity", 
such as &#x00E7; .

This causes all XMLHttpRequests from Firefox (v29-v55, inclusive, and 
possibly other families of) browsers to fail with "not well-formed", 
such that the XHR request returns a null object, killing the 
application's attempt to process any data.  The failure persists as long 
as the entry remains on the feed, breaking apps for hours, days or 
weeks, depending on feed in use.

The effect is reproducible in Firefox's built-in feed reader, where the 
list of entries is truncated such that the last entry displayed is the 
one immediately preceding the entry with the malformed character. 
Fetching the file with wget also shows the raw 0xE7 character (when 
viewed in a text editor with a hex editing feature).

For comparison, the GeoJSON version of the feed does have the character 
properly escaped as \u00e7 .  It is crucial that the analogous operation 
be performed on all multi-byte UTF-8 characters in an XML-based 
document, as this trivial error causes catastrophic failure of 
XMLHttpRequest.

Arguably, browsers should not fail in this manner, when the entire 
document structure is well-formed.  A single unknown character in a text 
node of an XML document in UTF-8 format should simply insert the '?' 
character, and if and only if there is a problem with the structure of 
the XML tabs, attributes, etc, should there be a hard failure.  However, 
since the targeted support range is already implemented for several 
years - following a decade+ history of tradition - it can't be changed 
and likely won't be (though we can always try to file a bug report).

Although I may switch to the GeoJSON format as a direct result of all 
this hassle, I am still curious to have the issue properly resolved, for 
any and all client use cases where, for whatever reason, Atom retrieved 
by XHR is the only option.

As a sub-optimal client-side work-around, I've thought that perhaps the 
file can be retrieved as a XMLHttpRequest.responseType = 'blob' to 
ignore the overly-strict XML parsing, iterate through each single 
character looking for values outside of range, replacing with a properly 
escaped entity, saved to disk, then re-opened with a second XHR, then 
continue processing with existing code.  The performance hit might not 
be worth it.

Another option might be to detect the XHR response status of "200 OK" in 
conjunction with a null XHR object, interpret that as a "not 
well-formed" failure, and fall back to a shorter list (month to week, 
week to day, day to hour) and hope to start picking up new entries with 
minimal downtime.  This would probably suffer less of a performance hit, 
but still exposes users to errors and incomplete information, making the 
app look broken.

Perhaps a combination of these approaches would be optimal until the 
Atom feed is properly escaped.

I've tried looking at the GitHub repository to try and find if the code 
that generates the static Atom list is available, but I couldn't seem to 
find it.  The closest thing I could find was the earthquake-event-ws[1] 
but it didn't seem to have any code doing escaping, not even for 
GeoJSON.  So I was a bit confused.

[1] 
https://github.com/usgs/earthquake-event-ws/blob/master/src/lib/classes/fdsn/

-- 
Leif