Opened 6 years ago
Closed 6 years ago
#2157 closed enhancement (fixed)
Investigate if we can implement UTF-8 with ISO-8859-1 fallback when parsing text files
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | major | Milestone: | BASE 3.15 |
Component: | core | Version: | |
Keywords: | Cc: |
Description
See also #2156.
The two most common encodings (in our lab) is UTF-8 and ISO-8859-1 (or Windows-1252), but the typical user has no idea about which encoding is used. On the server side we always prefer UTF-8 since that will cause less problems in the long run. Then there is the problem with Microsoft Excel on Windows which seems to be problematic when it comes to saving text as UTF-8. It would be nice to have a parser that is able to parse UTF-8 and if it encounter invalid sequences interpret that as ISO-8859-1 or Windows-1252.
For background information see: https://en.wikipedia.org/wiki/UTF-8
Change History (6)
comment:1 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → accepted |
comment:2 by , 6 years ago
comment:6 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | accepted → closed |
I have tested this: https://github.com/raek/utf8-with-fallback which has support for UTF-8 and fallback to either ISO-8859-1 or Windows-1252. It seems to work as promised, but there are a few things to consider:
Charset.forName()
method which is what we use more or less all the time. Instead it has to be created withnew Utf8WithFallbackCharsetProvider().charsetForName()
.This means that the fallback is not a usable setting for the
defaultCharset
inbase.config
since it will likely fail unless we change our code in a lot of places. It is also likely to affect extensions.I think we should go easy to begin with and only include it in a few places in BASE. For example, importer plug-ins that are reading from text files. More specifically, if we add support for the fallback character sets to the
FlatFileParser
a lot of plug-ins should be able to use it, but the option need to be enabled in the GUI as well.