#1432 closed enhancement (fixed)
TableWriter should provide support for escaping data
Reported by: | Nicklas Nordborg | Owned by: | everyone |
---|---|---|---|
Priority: | critical | Milestone: | BASE 2.15 |
Component: | core | Version: | |
Keywords: | Cc: |
Description
The TableWriter
class is, for example, used by when creating BASEfile:s. As it turns the current implementation will break if data contains 'forbidden' characters since they are not escaped. For example, if the data contains a tab it will break the file format since tab is used as a column delimiter. Newlines are escaped in some places, but in other places they are not which also breaks the file.
Support for escaping data has to be added at various levels in the export code. We will also need a corresponding functionality for unescaping the data in the importer code. The question is how BASE 1 handled this. If BASE 1 used some kind of escaping we should at least be compatible with that since that is what BASE 1 plug-ins expect. If BASE 1 doesn't escape data either we have more freedom.
Change History (5)
comment:1 by , 15 years ago
comment:2 by , 15 years ago
(In [5188]) References #1432: TableWriter and related classes should provide support for escaping data
This should add support to the TableWriter. The default is to not encode anything to keep things backwards compatible. The actual encoding scheme can be plugged in by implementing the EncoderDecoder
interface.
Maybe the specific case of BASEfiles should have it's own ticket. It's a bit more complicated and includes both reading and writing and there is still some old code in Base1PluginExecuter that should be fixed.
follow-up: 5 comment:3 by , 15 years ago
Hmmm... I have checked out the code that we have that for reading and writing BASEfile:s in various places... and I don't know if we should touch the BASEfile format as it is implemented now. I think it will break existing plug-ins if we start to encode things. There are a lot of inconsistent joining and splitting of strings that seems to work as long as the data itself doesn't include tabs, newlines or some other 'special' characters. There is inconsistent use of delimiters in headers and also in the data part. In some places a forward slash is used (bioassay ids), in some cases a comma (annotation values), in some places the string '\t' (used columns in plug-in definition) and in some places a real tab (used columns in data export).
So, unless someone is prepared to test, re-configure and fix program code for BASE 1 plug-ins my suggestion is that we leave the Base1PluginExecuter
and BASEfile
reading and writing as it is. Things that are working now will continue to do so as long as the data doesn't include any problematic character.
Avoid tabs, newlines, backslash and comma in names of bioassays, reporters, formulas, annotation types, annotation values and extra values.
comment:4 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Summary: | TableWriter and related classes should provide support for escaping data → TableWriter should provide support for escaping data |
The support that was added to the TableWriter class can be useful later on.
comment:5 by , 15 years ago
Replying to nicklas:
I agree, if it works don't try to fix it. Many of the BASE1 plug-ins have no maintainer.
I have been investigating code from BASE 1, but I can't find any evidence that any form of escaping is handled. Here are a few of the more obvious places I would expect encoding to take place...
So it seems like BASEfile:s in BASE 1 assumes that no tabs, newlines, etc. are present in the data. If there is that kind of data the BASEfile is likely to break.
This means that we can choose an encoding scheme without having to worry about backwards compatibility. I can see two alternatives:
The second option is already partly implemented in some parts of BASEfile reading/writing in BASE 2. When writing, section names and header values have newlines and carriage returns encoded (but not tabs), but nothing is decoded when reading. The only exception is that the
Base1PluginExecuter
decodes \n to newline and \r to carriage return, but only for the 'descr' field. No other fields are decoded.I guess the only reason that everything still works is that no data is using any of the 'bad' characters. The situation in BASE 2 is just as bad as in BASE 1.