Opened 5 years ago

Closed 5 years ago

#2156 closed enhancement (fixed)

Check UTF-8 for text files that are uploaded without a selected character set

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 3.15
Component: core Version:
Keywords: Cc:

Description

When uploading files to BASE the "Character set" option is typically set to -n/a- unless the user manually changes the encoding (which is typically very rare).

This can cause problems later on since parsing the text file have to rely on the defaultCharset setting in base.config which have a default value of ISO-8859-1.

While it is not possible to check every possible character encoding, checking if a file is UTF-8 is possible and I think we should implement a check in the file upload. If the file is detected to be UTF-8 compatible we set that, otherwise we keep the character set empty.

Change History (4)

comment:1 by Nicklas Nordborg, 5 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newaccepted

comment:2 by Nicklas Nordborg, 5 years ago

In 7623:

References #2156: Check UTF-8 for text files that are uploaded without a selected character set

Implemented a utility class CharsetDetector that can be used for simple testing of encoding in text files. It works best with encodings that can be technically detected. For example, UTF-8 is very unlikely to be mixed up with other encodings while any of the ISO-8859-x encodings can typically be used for all files. The StringDetector is intended to be used for discriminating between ISO-8859-x encodings but it requires prior knowledge of text that is expected to be found in the file that is unique to an encoding.

The file upload functionality has been extended to check for UTF-8 text files. It is enabled automatically when the MIME type is set to something in the 'text/*' subset and no character set has been explicitely specified.

comment:3 by Nicklas Nordborg, 5 years ago

In 7624:

References #2156: Check UTF-8 for text files that are uploaded without a selected character set

Added test code for testing the CharsetDetector.

comment:4 by Nicklas Nordborg, 5 years ago

Resolution: fixed
Status: acceptedclosed
Note: See TracTickets for help on using tickets.