Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#1432 closed enhancement (fixed)

TableWriter should provide support for escaping data

Reported by: Nicklas Nordborg Owned by: everyone
Priority: critical Milestone: BASE 2.15
Component: core Version:
Keywords: Cc:

Description

The TableWriter class is, for example, used by when creating BASEfile:s. As it turns the current implementation will break if data contains 'forbidden' characters since they are not escaped. For example, if the data contains a tab it will break the file format since tab is used as a column delimiter. Newlines are escaped in some places, but in other places they are not which also breaks the file.

Support for escaping data has to be added at various levels in the export code. We will also need a corresponding functionality for unescaping the data in the importer code. The question is how BASE 1 handled this. If BASE 1 used some kind of escaping we should at least be compatible with that since that is what BASE 1 plug-ins expect. If BASE 1 doesn't escape data either we have more freedom.

Change History (5)

comment:1 by Nicklas Nordborg, 14 years ago

I have been investigating code from BASE 1, but I can't find any evidence that any form of escaping is handled. Here are a few of the more obvious places I would expect encoding to take place...

So it seems like BASEfile:s in BASE 1 assumes that no tabs, newlines, etc. are present in the data. If there is that kind of data the BASEfile is likely to break.

This means that we can choose an encoding scheme without having to worry about backwards compatibility. I can see two alternatives:

  1. One-way encoding: All 'bad' characters are replaced by, for example, a space. The beneift is that the BASEfile is guaranteed to have a correct format and that no special decoding is needed when when reading. The drawback is that the data is changed and there is no way to go back to exactly the original data. This doesn't matter much for descriptive texts, but may be critical for names and ID:s.
  2. Two-way encoding: 'Bad' characters are encoded in a way that makes it possible to decode the data and get back the exact original strings. The following should be enough:
    • newline --> \n
    • carriage return --> \r
    • tab --> \t
    • backslash -->

The second option is already partly implemented in some parts of BASEfile reading/writing in BASE 2. When writing, section names and header values have newlines and carriage returns encoded (but not tabs), but nothing is decoded when reading. The only exception is that the Base1PluginExecuter decodes \n to newline and \r to carriage return, but only for the 'descr' field. No other fields are decoded.

I guess the only reason that everything still works is that no data is using any of the 'bad' characters. The situation in BASE 2 is just as bad as in BASE 1.

comment:2 by Nicklas Nordborg, 14 years ago

(In [5188]) References #1432: TableWriter and related classes should provide support for escaping data

This should add support to the TableWriter. The default is to not encode anything to keep things backwards compatible. The actual encoding scheme can be plugged in by implementing the EncoderDecoder interface.

Maybe the specific case of BASEfiles should have it's own ticket. It's a bit more complicated and includes both reading and writing and there is still some old code in Base1PluginExecuter that should be fixed.

comment:3 by Nicklas Nordborg, 14 years ago

Hmmm... I have checked out the code that we have that for reading and writing BASEfile:s in various places... and I don't know if we should touch the BASEfile format as it is implemented now. I think it will break existing plug-ins if we start to encode things. There are a lot of inconsistent joining and splitting of strings that seems to work as long as the data itself doesn't include tabs, newlines or some other 'special' characters. There is inconsistent use of delimiters in headers and also in the data part. In some places a forward slash is used (bioassay ids), in some cases a comma (annotation values), in some places the string '\t' (used columns in plug-in definition) and in some places a real tab (used columns in data export).

So, unless someone is prepared to test, re-configure and fix program code for BASE 1 plug-ins my suggestion is that we leave the Base1PluginExecuter and BASEfile reading and writing as it is. Things that are working now will continue to do so as long as the data doesn't include any problematic character.

Avoid tabs, newlines, backslash and comma in names of bioassays, reporters, formulas, annotation types, annotation values and extra values.

comment:4 by Nicklas Nordborg, 14 years ago

Resolution: fixed
Status: newclosed
Summary: TableWriter and related classes should provide support for escaping dataTableWriter should provide support for escaping data

The support that was added to the TableWriter class can be useful later on.

in reply to:  3 comment:5 by Jari Häkkinen, 14 years ago

Replying to nicklas:

I agree, if it works don't try to fix it. Many of the BASE1 plug-ins have no maintainer.

Note: See TracTickets for help on using tickets.