Context Navigation

Back to Ticket #1440

Ticket #1440: bfs-generic-1.txt

File bfs-generic-1.txt, 6.2 KB (added by Nicklas Nordborg, 14 years ago)
Generic specification of various file types in BFS

Line
1	This document discuss some generic rules and guidelines for
2	formatting and parsing files using the BFS format. Specific use
3	cases for BFS files are likely to define additional rules,
4	particularly with regards to the metadata file. The only current
5	use case we have in mind is to use BFS for passing data to and from
6	external plug-ins. In the future the BFS format may be used for other
7	use cases.
8
9	We define three different file types in BFS:
10
11	* Metadata files
12	* Annotation files
13	* Data files
14
15
16	Common to all files
17	====================
18
19	All files are text-based and uses the UTF-8 character encoding.
20
21	A newline character (\n) is used as a record separator and
22	a tab character (\t) is used as a column separator.
23
24	Escape sequences
25	----------------
26
27	Data that contains tabs or newlines needs to be escaped. We will use
28	a backslash (\) to indicate the start of an escaped sequence. This means
29	that a backslash must also be escaped. Since some editors includes a
30	carriage return in line breaks breaks, we should also escape carriage
31	return (\r).
32
33	Here is the very simple escape table:
34	<backslash> --> \\
35	<newline> --> \n
36	<carriage return> --> \r
37	<tab> --> \t
38
39	It is recommended that parsers are forgiving and if an invalid escape
40	sequence is found, eg. a backslash followed by anything else than
41	\, n, r or t, the input is taken literally. Strict parsers may throw
42	exceptions of log warning messages.
43
44
45	Numerical values
46	----------------
47
48	Numeric values should use dot (.) as decimal point. Scientific notation
49	is accepted. Null, NaN, Infinity, and other special values should all
50	be represented by empty string values. It is recommended that parsers
51	are forgiving if invalid numerical data is found.
52
53	Comments, etc.
54	--------------
55	Lines starting with '#' are comment lines and should be ignored.
56
57	Empty lines (=lines with only white-space) should be ignored.
58
59	White-space: space, tabs and other characters that matches '\s'
60	in regular expressions.
61
62
63
64	Metadata file
65	=============
66
67	The metadata file contains information about the other files
68	in the file-set. It can also contain information that is specific
69	for each use case. This file contains key-value pairs in multiple
70	sections.
71
72
73	Beginning-of-file (BOF) marker
74	------------------------------
75	A BFS metadata file should start with the string 'BFSformat',
76	optionally followed by a tab and a value. This must be the
77	first line in the file. The value is used as an indication of
78	the sub-type of the file.
79
80
81	Sections
82	--------
83	A section is started by surrounding a value in brackets.
84	Eg. [my section]
85
86	The is no restriction on the name of the section as long as it is
87	escaped using the normal rules. Note that there is no need to escape
88	brackets in the name, eg. [[a,b]] is a valid section with the name
89	'[a,b]'. Trailing white-space should be ignored.
90
91	Multiple sections may have the same name and the order of sections
92	should not matter. However, this may be restricted in specific use
93	cases, which may require that section names are unique or come in a
94	specific order.
95
96	Generic parsers are recommended to provide access to sections by name
97	and by ordinal number, starting at 0. Generic writers are recommended
98	to write sections in the order they are added.
99
100
101	Section entries
102	---------------
103	Each section contains data in the form of tab-separated key-value
104	pairs. Keys may not start with # or [, since this would interfere
105	with comments and sections. Otherwise, the normal escape rules are
106	used. Values should also use the normal escape rules, except that
107	non-escaped tab characters are allowed. This makes it possible to
108	use vector-type values.
109
110	A key doesn't have to be unique within a section. But this may be
111	limited by specific use cases globally or on a section-by-section
112	basis. The order of the keys are usually not important, but some use
113	cases may need to preserve the order.
114
115	Generic reader implementations are recommended to provide access to
116	keys by name and by ordinal number, starting at 0. Generic writers
117	implementations are recommended to write keys and values in the order
118	they are added to each section.
119
120
121	Pre-defined sections and keys
122	-----------------------------
123
124	If the file-set includes more files than the metadata file, a 'files'
125	section is required that specifies the other files. Keys may have any
126	name and it is recommended that each key is unique. The value is the
127	filename.
128
129	[files]
130	file-1 abc123.txt
131	file-2 def456.txt
132	file-3 ghi789.txt
133
134	The files are expected to be located in the same 'directory' as the current
135	metadata file. A directory may be a folder in the file system, a zip-file,
136	or a similar container. Metadata about the file types and file content is
137	not part of the generic specification. Specific use cases may define
138	additional sections for holding metadata about the file content.
139
140	Note! The files doesn't have to be BFS type files. They can be image files,
141	pdf files, etc.
142
143
144	Annotation files
145	================
146
147	The first line is a header line containing the column names for each column.
148	The first column is required and must be 'ID'. Other columns are optional,
149	but must must have unique names. Column names are separated with tabs and
150	are encoded using the normal rules.
151
152	All other lines are data lines. Each line must have exactly the same number
153	of columns as the header line.
154
155	Comment lines are not supported.
156
157	The ID column holds a unique identifier used internally by BASE. A given ID
158	should only be used once and may not be repeated later in the file. The ID
159	is a numeric positive integer value. Zero and negative values are not allowed.
160	There is no special ordering (unless a specific use-case require this). Note
161	that the ID values are not coordinates. They don't have to start at 1 and there
162	may be "holes" in the range of values used. Some use-cases may use ID values
163	with some specific meaning, other use-cases may simple enumerate the rows using
164	a counter.
165
166
167	Data files
168	==========
169
170	A single data file is a matrix containing one data value for each row-column
171	element.
172
173	Data starts on the first line. There is no header line.
174
175	All data lines should have the same number of columns. The number of rows and
176	columns and their order are defined by other, use-case specfic, information in
177	the metadata file or in annotation file(s).
178
179	Comment lines are not supported.
180

Download in other formats:

Original Format