Opened 3 years ago

Closed 3 years ago

#2157 closed enhancement (fixed)

Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 3.15
Component: core Version:
Keywords: Cc:

Description

See also #2156.

The two most common encodings (in our lab) is UTF-8 and ISO-8859-1 (or Windows-1252), but the typical user has no idea about which encoding is used. On the server side we always prefer UTF-8 since that will cause less problems in the long run. Then there is the problem with Microsoft Excel on Windows which seems to be problematic when it comes to saving text as UTF-8. It would be nice to have a parser that is able to parse UTF-8 and if it encounter invalid sequences interpret that as ISO-8859-1 or Windows-1252.

For background information see: https://en.wikipedia.org/wiki/UTF-8

Change History (6)

comment:1 Changed 3 years ago by Nicklas Nordborg

Owner: changed from everyone to Nicklas Nordborg
Status: newaccepted

comment:2 Changed 3 years ago by Nicklas Nordborg

I have tested this: https://github.com/raek/utf8-with-fallback which has support for UTF-8 and fallback to either ISO-8859-1 or Windows-1252. It seems to work as promised, but there are a few things to consider:

  • It has no support for writing
  • It is not possible to install in a way that makes it easily usable, eg. by calling Charset.forName() method which is what we use more or less all the time. Instead it has to be created with new Utf8WithFallbackCharsetProvider().charsetForName().

This means that the fallback is not a usable setting for the defaultCharset in base.config since it will likely fail unless we change our code in a lot of places. It is also likely to affect extensions.

I think we should go easy to begin with and only include it in a few places in BASE. For example, importer plug-ins that are reading from text files. More specifically, if we add support for the fallback character sets to the FlatFileParser a lot of plug-ins should be able to use it, but the option need to be enabled in the GUI as well.

comment:3 Changed 3 years ago by Nicklas Nordborg

In 7627:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added the "UTF-8 with fallback" provider to BASE. In order to use them the CharsetUtil.getCharset() method must be used with names:

  • X-UTF-8_with_ISO-8859-1_fallback
  • X-UTF-8_with_windows-1252_fallback

The CharsetUtil.getCharset() method can also lookup all system character sets that is usually found by the ordinary Charset.forName() method.

The Config.getAllCharsets() has not been modified. It will continue to only return system-defined character sets. To get a list that includes the fallback character sets, use CharsetUtil.getAllCharsets().

The FlatFileParser implementation has been updated to support the fallback character sets and so has all batch item and annotation importers.

Extensions that want to use the fallback character sets probably need to be updated.

comment:4 Changed 3 years ago by Nicklas Nordborg

In 7628:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added support to the 'auto-detect' dialog as well.

comment:5 Changed 3 years ago by Nicklas Nordborg

In 7629:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added test case for the UTF-8 with fallback charsets.

comment:6 Changed 3 years ago by Nicklas Nordborg

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.