[Realtime-feed-users] XML-based Atom Feed format problems with malformed UTF-8 characters

Mon Mar 13 14:01:36 UTC 2017

Hi Leif,

Thank you for your bug report. The project you reference is indeed the
source code for our web-based event feeds. I have logged your report as a
bug on the project here:
https://github.com/usgs/earthquake-event-ws/issues/224. We will work to
correct this issue in the future however we do *recommend* any and all
programmatic access to our data use the GeoJSON format.

Regards,

Eric Martinez
U.S. Geological Survey
emartinez at usgs.gov

On Mon, Mar 13, 2017 at 4:46 AM, Leif AMO <leif.amo at gmail.com> wrote:

> Recently sent a message using a web-based contact form regarding this
> issue.  If this arrives in duplicate, my apologies.  Just discovered this
> list and thought I'd mention it here as well.
>
> The XML-Based Atom feed ( event us100087xb, https://earthquake.usgs.gov/ea
> rthquakes/feed/v1.0/summary/4.5_week.atom or perhaps try monthly if the
> entry has moved off the weekly list ) has a malformed UTF-8 character
> U+00E7 (lower-case 'c' with cedilla).  It is incorrectly included in the
> feed as a raw single-byte with a value of 0xE7, whereas it should be
> included as an escaped "character entity", such as &#x00E7; .
>
> This causes all XMLHttpRequests from Firefox (v29-v55, inclusive, and
> possibly other families of) browsers to fail with "not well-formed", such
> that the XHR request returns a null object, killing the application's
> attempt to process any data.  The failure persists as long as the entry
> remains on the feed, breaking apps for hours, days or weeks, depending on
> feed in use.
>
> The effect is reproducible in Firefox's built-in feed reader, where the
> list of entries is truncated such that the last entry displayed is the one
> immediately preceding the entry with the malformed character. Fetching the
> file with wget also shows the raw 0xE7 character (when viewed in a text
> editor with a hex editing feature).
>
> For comparison, the GeoJSON version of the feed does have the character
> properly escaped as \u00e7 .  It is crucial that the analogous operation be
> performed on all multi-byte UTF-8 characters in an XML-based document, as
> this trivial error causes catastrophic failure of XMLHttpRequest.
>
> Arguably, browsers should not fail in this manner, when the entire
> document structure is well-formed.  A single unknown character in a text
> node of an XML document in UTF-8 format should simply insert the '?'
> character, and if and only if there is a problem with the structure of the
> XML tabs, attributes, etc, should there be a hard failure.  However, since
> the targeted support range is already implemented for several years -
> following a decade+ history of tradition - it can't be changed and likely
> won't be (though we can always try to file a bug report).
>
> Although I may switch to the GeoJSON format as a direct result of all this
> hassle, I am still curious to have the issue properly resolved, for any and
> all client use cases where, for whatever reason, Atom retrieved by XHR is
> the only option.
>
> As a sub-optimal client-side work-around, I've thought that perhaps the
> file can be retrieved as a XMLHttpRequest.responseType = 'blob' to ignore
> the overly-strict XML parsing, iterate through each single character
> looking for values outside of range, replacing with a properly escaped
> entity, saved to disk, then re-opened with a second XHR, then continue
> processing with existing code.  The performance hit might not be worth it.
>
> Another option might be to detect the XHR response status of "200 OK" in
> conjunction with a null XHR object, interpret that as a "not well-formed"
> failure, and fall back to a shorter list (month to week, week to day, day
> to hour) and hope to start picking up new entries with minimal downtime.
> This would probably suffer less of a performance hit, but still exposes
> users to errors and incomplete information, making the app look broken.
>
> Perhaps a combination of these approaches would be optimal until the Atom
> feed is properly escaped.
>
> I've tried looking at the GitHub repository to try and find if the code
> that generates the static Atom list is available, but I couldn't seem to
> find it.  The closest thing I could find was the earthquake-event-ws[1] but
> it didn't seem to have any code doing escaping, not even for GeoJSON.  So I
> was a bit confused.
>
> [1] https://github.com/usgs/earthquake-event-ws/blob/master/src/
> lib/classes/fdsn/
>
> --
> Leif
> _______________________________________________
> Realtime-feed-users mailing list
> Realtime-feed-users at geohazards.usgs.gov
> https://geohazards.usgs.gov/mailman/listinfo/realtime-feed-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://geohazards.usgs.gov/pipermail/realtime-feed-users/attachments/20170313/2c0f278b/attachment.html>