Context Navigation

← Previous Ticket
Next Ticket →

#2157 closed enhancement (fixed)

Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Reported by:	Nicklas Nordborg	Owned by:	Nicklas Nordborg
Priority:	major	Milestone:	BASE 3.15
Component:	core	Version:
Keywords:		Cc:

Description

Change History (6)

comment:1 by Nicklas Nordborg, 6 years ago

Owner:	changed from everyone to Nicklas Nordborg
Status:	new → accepted

comment:2 by Nicklas Nordborg, 6 years ago

I have tested this: https://github.com/raek/utf8-with-fallback which has support for UTF-8 and fallback to either ISO-8859-1 or Windows-1252. It seems to work as promised, but there are a few things to consider:

It has no support for writing
It is not possible to install in a way that makes it easily usable, eg. by calling Charset.forName() method which is what we use more or less all the time. Instead it has to be created with new Utf8WithFallbackCharsetProvider().charsetForName().

This means that the fallback is not a usable setting for the defaultCharset in base.config since it will likely fail unless we change our code in a lot of places. It is also likely to affect extensions.

I think we should go easy to begin with and only include it in a few places in BASE. For example, importer plug-ins that are reading from text files. More specifically, if we add support for the fallback character sets to the FlatFileParser a lot of plug-ins should be able to use it, but the option need to be enabled in the GUI as well.

comment:3 by Nicklas Nordborg, 6 years ago

In 7627:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added the "UTF-8 with fallback" provider to BASE. In order to use them the CharsetUtil.getCharset() method must be used with names:

X-UTF-8_with_ISO-8859-1_fallback
X-UTF-8_with_windows-1252_fallback

The CharsetUtil.getCharset() method can also lookup all system character sets that is usually found by the ordinary Charset.forName() method.

The Config.getAllCharsets() has not been modified. It will continue to only return system-defined character sets. To get a list that includes the fallback character sets, use CharsetUtil.getAllCharsets().

The FlatFileParser implementation has been updated to support the fallback character sets and so has all batch item and annotation importers.

Extensions that want to use the fallback character sets probably need to be updated.

comment:4 by Nicklas Nordborg, 6 years ago

In 7628:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added support to the 'auto-detect' dialog as well.

comment:5 by Nicklas Nordborg, 6 years ago

In 7629:

References #2157: Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files

Added test case for the UTF-8 with fallback charsets.

comment:6 by Nicklas Nordborg, 6 years ago

Resolution:	→ fixed
Status:	accepted → closed

Note: See TracTickets for help on using tickets.

Download in other formats: