[Realtime-feed-users] XML-based Atom Feed format problems with malformed UTF-8 characters
Leif AMO
leif.amo at gmail.com
Mon Mar 13 10:46:39 UTC 2017
Recently sent a message using a web-based contact form regarding this
issue. If this arrives in duplicate, my apologies. Just discovered
this list and thought I'd mention it here as well.
The XML-Based Atom feed ( event us100087xb,
https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/4.5_week.atom
or perhaps try monthly if the entry has moved off the weekly list ) has
a malformed UTF-8 character U+00E7 (lower-case 'c' with cedilla). It is
incorrectly included in the feed as a raw single-byte with a value of
0xE7, whereas it should be included as an escaped "character entity",
such as ç .
This causes all XMLHttpRequests from Firefox (v29-v55, inclusive, and
possibly other families of) browsers to fail with "not well-formed",
such that the XHR request returns a null object, killing the
application's attempt to process any data. The failure persists as long
as the entry remains on the feed, breaking apps for hours, days or
weeks, depending on feed in use.
The effect is reproducible in Firefox's built-in feed reader, where the
list of entries is truncated such that the last entry displayed is the
one immediately preceding the entry with the malformed character.
Fetching the file with wget also shows the raw 0xE7 character (when
viewed in a text editor with a hex editing feature).
For comparison, the GeoJSON version of the feed does have the character
properly escaped as \u00e7 . It is crucial that the analogous operation
be performed on all multi-byte UTF-8 characters in an XML-based
document, as this trivial error causes catastrophic failure of
XMLHttpRequest.
Arguably, browsers should not fail in this manner, when the entire
document structure is well-formed. A single unknown character in a text
node of an XML document in UTF-8 format should simply insert the '?'
character, and if and only if there is a problem with the structure of
the XML tabs, attributes, etc, should there be a hard failure. However,
since the targeted support range is already implemented for several
years - following a decade+ history of tradition - it can't be changed
and likely won't be (though we can always try to file a bug report).
Although I may switch to the GeoJSON format as a direct result of all
this hassle, I am still curious to have the issue properly resolved, for
any and all client use cases where, for whatever reason, Atom retrieved
by XHR is the only option.
As a sub-optimal client-side work-around, I've thought that perhaps the
file can be retrieved as a XMLHttpRequest.responseType = 'blob' to
ignore the overly-strict XML parsing, iterate through each single
character looking for values outside of range, replacing with a properly
escaped entity, saved to disk, then re-opened with a second XHR, then
continue processing with existing code. The performance hit might not
be worth it.
Another option might be to detect the XHR response status of "200 OK" in
conjunction with a null XHR object, interpret that as a "not
well-formed" failure, and fall back to a shorter list (month to week,
week to day, day to hour) and hope to start picking up new entries with
minimal downtime. This would probably suffer less of a performance hit,
but still exposes users to errors and incomplete information, making the
app look broken.
Perhaps a combination of these approaches would be optimal until the
Atom feed is properly escaped.
I've tried looking at the GitHub repository to try and find if the code
that generates the static Atom list is available, but I couldn't seem to
find it. The closest thing I could find was the earthquake-event-ws[1]
but it didn't seem to have any code doing escaping, not even for
GeoJSON. So I was a bit confused.
[1]
https://github.com/usgs/earthquake-event-ws/blob/master/src/lib/classes/fdsn/
--
Leif
More information about the Realtime-feed-users
mailing list