Ticket #1440: bfs-generic-1.txt

File bfs-generic-1.txt, 6.2 KB (added by Nicklas Nordborg, 14 years ago)

Generic specification of various file types in BFS

Line 
1This document discuss some generic rules and guidelines for
2formatting and parsing files using the BFS format. Specific use
3cases for BFS files are likely to define additional rules,
4particularly with regards to the metadata file. The only current
5use case we have in mind is to use BFS for passing data to and from
6external plug-ins. In the future the BFS format may be used for other
7use cases.
8
9We define three different file types in BFS:
10
11 * Metadata files
12 * Annotation files
13 * Data files
14
15
16Common to all files
17====================
18
19All files are text-based and uses the UTF-8 character encoding.
20
21A newline character (\n) is used as a record separator and
22a tab character (\t) is used as a column separator.
23
24Escape sequences
25----------------
26
27Data that contains tabs or newlines needs to be escaped. We will use
28a backslash (\) to indicate the start of an escaped sequence. This means
29that a backslash must also be escaped. Since some editors includes a
30carriage return in line breaks breaks, we should also escape carriage
31return (\r).
32
33Here is the very simple escape table:
34 <backslash> --> \\
35 <newline> --> \n
36 <carriage return> --> \r
37 <tab> --> \t
38
39It is recommended that parsers are forgiving and if an invalid escape
40sequence is found, eg. a backslash followed by anything else than
41\, n, r or t, the input is taken literally. Strict parsers may throw
42exceptions of log warning messages.
43
44
45Numerical values
46----------------
47
48Numeric values should use dot (.) as decimal point. Scientific notation
49is accepted. Null, NaN, Infinity, and other special values should all
50be represented by empty string values. It is recommended that parsers
51are forgiving if invalid numerical data is found.
52
53Comments, etc.
54--------------
55Lines starting with '#' are comment lines and should be ignored.
56
57Empty lines (=lines with only white-space) should be ignored.
58
59White-space: space, tabs and other characters that matches '\s'
60in regular expressions.
61
62
63
64Metadata file
65=============
66
67The metadata file contains information about the other files
68in the file-set. It can also contain information that is specific
69for each use case. This file contains key-value pairs in multiple
70sections.
71
72
73Beginning-of-file (BOF) marker
74------------------------------
75A BFS metadata file should start with the string 'BFSformat',
76optionally followed by a tab and a value. This must be the
77first line in the file. The value is used as an indication of
78the sub-type of the file.
79
80
81Sections
82--------
83A section is started by surrounding a value in brackets.
84Eg. [my section]
85
86The is no restriction on the name of the section as long as it is
87escaped using the normal rules. Note that there is no need to escape
88brackets in the name, eg. [[a,b]] is a valid section with the name
89'[a,b]'. Trailing white-space should be ignored.
90
91Multiple sections may have the same name and the order of sections
92should not matter. However, this may be restricted in specific use
93cases, which may require that section names are unique or come in a
94specific order.
95
96Generic parsers are recommended to provide access to sections by name
97and by ordinal number, starting at 0. Generic writers are recommended
98to write sections in the order they are added.
99
100
101Section entries
102---------------
103Each section contains data in the form of tab-separated key-value
104pairs. Keys may not start with # or [, since this would interfere
105with comments and sections. Otherwise, the normal escape rules are
106used. Values should also use the normal escape rules, except that
107non-escaped tab characters are allowed. This makes it possible to
108use vector-type values.
109
110A key doesn't have to be unique within a section. But this may be
111limited by specific use cases globally or on a section-by-section
112basis. The order of the keys are usually not important, but some use
113cases may need to preserve the order.
114
115Generic reader implementations are recommended to provide access to
116keys by name and by ordinal number, starting at 0. Generic writers
117implementations are recommended to write keys and values in the order
118they are added to each section.
119
120
121Pre-defined sections and keys
122-----------------------------
123
124If the file-set includes more files than the metadata file, a 'files'
125section is required that specifies the other files. Keys may have any
126name and it is recommended that each key is unique. The value is the
127filename.
128
129[files]
130file-1 abc123.txt
131file-2 def456.txt
132file-3 ghi789.txt
133
134The files are expected to be located in the same 'directory' as the current
135metadata file. A directory may be a folder in the file system, a zip-file,
136or a similar container. Metadata about the file types and file content is
137not part of the generic specification. Specific use cases may define
138additional sections for holding metadata about the file content.
139
140Note! The files doesn't have to be BFS type files. They can be image files,
141pdf files, etc.
142
143
144Annotation files
145================
146
147The first line is a header line containing the column names for each column.
148The first column is required and must be 'ID'. Other columns are optional,
149but must must have unique names. Column names are separated with tabs and
150are encoded using the normal rules.
151
152All other lines are data lines. Each line must have exactly the same number
153of columns as the header line.
154
155Comment lines are not supported.
156
157The ID column holds a unique identifier used internally by BASE. A given ID
158should only be used once and may not be repeated later in the file. The ID
159is a numeric positive integer value. Zero and negative values are not allowed.
160There is no special ordering (unless a specific use-case require this). Note
161that the ID values are not coordinates. They don't have to start at 1 and there
162may be "holes" in the range of values used. Some use-cases may use ID values
163with some specific meaning, other use-cases may simple enumerate the rows using
164a counter.
165
166
167Data files
168==========
169
170A single data file is a matrix containing one data value for each row-column
171element.
172
173Data starts on the first line. There is no header line.
174
175All data lines should have the same number of columns. The number of rows and
176columns and their order are defined by other, use-case specfic, information in
177the metadata file or in annotation file(s).
178
179Comment lines are not supported.
180