The BASE File Set (BFS) format is a collection of file formats that can be used together to transport all kinds data. The major use is to send spot data to a plug-in for analysis and then to import the analyzed results. We have tried to keep the format generic and extendable so it is not unlikely that the BFS format can be used for other applications in the future.
The idea is to use simple, plain-text files with data organised into rows and columns. A single type of file may not be able to hold all kinds of data, so to begin with we have defined three types of files:
Metadata files: Holds information about the data that is found in the other files in the file set.
Annotation files: Column-based files that holds one record per line. The first line is a header line. The remaining lines are data lines identified by a unique positive ID value in the first column.
Data files: Pure matrix data files without header lines or ID columns. Data is usually identified by matching it line-by-line with data in annotation files, or with information in the metadata file.
All files are text-based and should use the UTF-8 character encoding.
A newline (\n
) is used as a record separator and a tab
(\t
) is used a column separator. Data that contains newline
or tab characters need to be escaped. A backslash (\
) is used
to indicate the start of an escaped sequence. This means that the backslash
character must also be escaped. Since some editors includes a carriage
return (\r
) in line breaks, we should also escape
carriage return.
Table M.1. Escaped characters in the BFS format
Character | Escape sequence |
---|---|
<backslash> |
\\ |
<newline> |
\n |
<carriage return> |
\r |
<tab> |
\t |
It is recommended that parsers are forgiving and if an invalid escape
sequence is found, eg. a backslash followed by anything else than
\
, n
, r
or t
,
the input is taken literally. Strict parsers may throw
exceptions or log warning messages.
Numeric values should use dot (.
) as decimal point. Scientific
notation is accepted. Null, NaN, Infinity, and other special values should all
be represented by empty string values. It is recommended that parsers
are forgiving and treat invalid numerical data as empty values.
Lines starting with #
are comment lines and should be ignored. Empty
lines should also be ignored. A line that contains only white-space is
considered as empty. White-space=spaces, tabs and other characters that
matches \s
in regular expressions.
Note | |
---|---|
This can only be used in metadata files. Annotation files and data files doesn't allow comments or empty lines. |
A BASE File Set usually contains one metadata file. This file contains information about the other files that make up the file set. The metadata file can also hold information that is specific to a use case.
A metadata file always starts with the beginning-of-file (BOF) marker
BFSformat
, optionally followed by a tab and a value indicating the
subtype of the file. This must be the first line of the file. Comments
or empty lines are not allowed before the beginning-of-file marker.
All data in a metadata file must be inside a section. A section is
started by surrounding a value in brackets on a line by it's own,
for example, [my section]
. There is no restriction on the name of the
section as long as it is escaped using the normal rules. Note that
there is no need to escape brackets in the name. For example,
[[a\\b]]
is a valid section with the name [a\b]
.
Trailing white-space after the closing bracket is ignored.
Multiple sections may have the same name, and the order of the sections is usually of no concern. However, this may be restricted in specific cases if there is need to, for example, require unique section names or enforce a specific order. Parsers are recommended to provide access to sections by name and by ordinal number, starting at 0 and writers are recommended to write sections in the order they are added.
Each section contains data in the form of tab-separated
key-value pairs. Keys may not start with #
or [
since this would interfere with comments and sections. Otherwise, the
normal escape rules should be used for both keys and values.
Values are allowed to use non-escaped tab characers, which makes
it possible to use vector-type values.
A key doesn't have to be unique within a section, but specific use cases may require this globally or on section-per-section basis. The order of the keys are usually not important, except if the use case requires it. Parser implementations are recommended to provide access to keys by name and by ordinal number, starting at 0. Generic writers implementations are recommended to write keys and values in the order they are added to each section.
If the file set includes more files than the metadata file, those
files should be listed in the [files]
section. Keys should be
unique, but there are no other restrictions. The value is the file name
without path information. The files are expected to be located in the same
container as the current metadata file. A container could for example be a
folder in the file system, a zip-file, or any other logical item
that group files. Metadata about the files and file content is not
part of the generic BFS specification. This is left to specific use cases.
Note | |
---|---|
Files doesn't have to be other BFS file types. It can be any type of files, like pdf files, images, etc. |
Example M.1. Example BFS metadata file
BFSformat subtype # The 'BFSformat' must be on the the first line, subtype is optional # A comment line starts with '#'. Empty lines are ignored # A section is started by enclosing the section name in brackets # Section entries are key/value pairs separated by tab # Vector-type values are allowed. Duplicate keys may or may # not be allowed depending on the use case. [settings] key-1 value1 key-2 value2a value2b # The 'files' section points to additional files in the file set # Keys should be unique [files] report report.txt table tabla-data.txt plot plotted-data.png
The first line is a header line containing the column names for each column.
The first column is required and must always be ID
. Other columns
are optional, but must have unique names. Column names are separated with
tabs and are encoded using the normal rules. All other lines are data lines.
Each line must have exactly the same number of columns
as the header line. Comment lines and empty lines are not supported, but
a column may have an empty value.
The ID column holds a unique identifier used internally by BASE. A given ID should only be used once and may not be repeated later in the file. The ID is a numeric positive integer value. Zero, negative or empty values are not allowed. There is no special ordering (unless a specific use-case require this). Note that the ID values are not indexes. They don't have to start at 1 and there may be "holes" in the range of values used. Some use-cases may use ID values with some specific meaning, other use-cases may simple enumerate the rows using a counter.
A data file is a matrix containing one data value for each row-column element. Data starts on the first line. There is no header line. All data lines must have the same number of columns. The number of rows and columns and their order are defined by other, use-case specfic, information in the metadata file or in annotation file(s). Comment lines and empty lines are not supported, but a column may hold an empty value.
The use case is to use BFS to transport data to and from an external analysis plug-in. The general outline is:
Export bioassay set data to BFS.
Execute the external plug-in which process the data and generates a new BFS.
Import the transformed data to BASE.
The export will generate at least two files. One metadata file and one data file. It is also possible to export reporter and assay annotations if the plug-in needs it. Note that reporter and assay annotation files are always needed if new spot data is going to be imported so in most cases at least four files will be created.
There are two subtypes:
serial: One data file is required for each assay. The columns in the data files represents different spot data values, eg. first column = Ch 1, second column = Ch 2, etc.
matrix: One data file is required for each spot data value. The columns in the data files represents assays.
For both subtypes the [files]
section is used
to name the files holding data and annotations. The following
entries should be used:
rdata: The filename of the file containing reporter annotations
pdata: The filename of the file containing assay annotations
sdata1, sdata2, ..., sdataN: N entries, numbered from 1 to N,
with the filenames of the files containing spot data. If the
serial subtype is used there should be one file for each assay
in the bioassayset. If the matrix subtype is used, there should
be one file for each entry in the [sdata]
section.
Other files may be included if they use x-
as a prefix.
Example:
BFSformat serial [files] rdata reporters.txt pdata assays.txt sdata1 Assay 1.txt sdata2 Assay 2.txt x-custom custom.txt
The [sdata]
section contains information
about the spot data that is found in the sdataX
files. The key of each entry is the name or title of the data
that is exported. The value describes the data type and can be
either text
, float
or int
.
The order in this section is important. If the matrix
subtype is used, the entries in this section must match the
sdataX
entries in the [files]
section.
Eg. the data that corresponds to the first entry in this section
is found in the sdata1
file. The number of entries
in this section must be the same as the number of sdataX
entries in the [files]
section.
If the serial subtype is used the entries in this section must
match the column order in each of the sdataX
files.
Eg. the data that corresponds to the first entry in this section
is found in the first column in all sdataX
files. The number of entries in this section must match the number of
columns in the sdataX
files.
Example:
[sdata] Ch 1 float Ch 2 float Weight float Flag int
The [parameters]
section contains extra parameters
needed by the plug-in. Keys and values are defined by the plug-in
and/or job configuration. Duplicate keys are not allowed, and order
is not important. Multiple values for the same parameter are separated
with a tab character.
Example:
[parameters] beta 0.5 length 100 vector 10 10.3 23 median true
The file used for reporter annotations is given by the rdata
entry in the [files]
section. This file is optional when exporting
but required when importing. The only required column is the ID
column, which holds the internal spot position values. All sdataX
files must have the same number of rows as this file (not counting the
header line) and data should be sorted in the same order. Additional columns may
be included in the export.
Note that the same underlying reporter may be assigned to more than one position. If the plug-in needs to operate on merged-per-reporter data the export should include either the internal or external reporter id in an additional column so that the plug-in can use this information to determine what should be merged. The exporter has no support for exporting merged data.
The file used for assay annotations is given by the pdata
entry in the [files]
section. This file is optional when
exporting but required when importing. The only required
column is the ID column, which holds the interal bioassay id values.
If the matrix subtype is used the columns in the sdataX
files must be in the same order as the assays appear in this file. The
number of columns in the data files must be the same as the number of rows
in this file (not counting the header line).
If the serial subtype is used, the sdata1
file has data
for the assay that is described in the first line in this file, the
sdata2
file has data for the second assay, etc. The number
of data files must match the number of lines in this file.
Data files contains data in matrix format. More than one data file may be required. The organisation of the data depends on the BFS subtype. In both subtypes the number and order of the rows must match the number and order of rows in the reporter annotations file.
If the matrix subtype is used, the columns in the data files corresponds
to assays. The number of columns and their order must match the lines
in the assay annotations file. The number of data files and their content
is defined by the entries in the [sdata]
section.
If the serial subtype is used, the the number of columns and their order
must match the entries in the [sdata]
section. Each data
file has data from one assay. The number of sdata files in the
[files]
section must match the number of lines in the
assay annotations file.
The above information is mostly true for both export and import, but
there are a few additional things that a plug-in should know about when
generating data that is going to be imported. The most important
thing is that both reporter and assay annotation files are required
for importing spot data. If the program only generates extra files
the [sdata]
section should not be included and no
data or annoatation files are need.
All files are specified in the [files]
section in the
same way as for the export. File entries starting with x-
will be uploaded to BASE and linked with the new bioassay set.
Note | |
---|---|
The importer currently supports importing spot data intensity values and extra files. Position/reporter mapping and child/parent assay mapping may remain the same or they may be changed. The importer can also upload additional files generated by the plug-in, for example plots. The importer has no support for importing extra values, reporter lists or annotations. |
In the metadata file, a [settings]
section may be included
to control certain aspects of the import. The following entries can be
used:
new-data-cube
: If this is set, the data is imported into a new
data cube. A new data cube is needed whenever the position/reporter
mappings has changed or when parent assays has been merged. This
setting requires that the reporter annotations file contains
information about the new mapping. It needs to include either
Internal ID
or External ID
columns so
that the importer can map the new position to the correct reporter.
The reporter must already exist in the database. The position values
have no relation to the position values in the old bioassay set. We
recommend that a plug-in simply starts enumerates the lines starting at
1.
multi-assay-parents
: If this is set, a child assay may have
more than one parent assay (for example, due to a merge). A new data
cube is needed and this setting is ignored unless
new-data-cube
is also set. This setting requires that the
assay annotations file has a Parent ID
column which
holds a comma-separated list with the ID:s of the parent assays.
transform
: If not specified, the child spot data is
assumed to use the same intensity transform as the parent data. To force
a specific a specific intensity transform for the child bioassay set
include this setting and choose one fo the values: none, log2, log10.
In the metadata file, the precense of an [sdata]
section
indicates that spot data should be imported. If this section is not
present only extra files are uploaded to BASE and they are attached to
the transformation instead of a child bioassay set. If the [sdata]
section is present it must include one entry for each channel with names like,
Ch 1
, Ch 2
, and so on. The value is always
float
. All other entries in this section are ignored.
In the reporter annotations file, the ID
column should hold
the position values. Values must be positive integers and
duplicates are not allowed. The order of the values doesn't
matter. If importing data to a new data cube the reporter annotations
file also needs either Internal ID
or External ID
columns.
In the assay annotations file, the ID
column usually holds the
internal assay id of the parent assay. The exception is if the
multi-assay-parents
options has been enabled. In this
case the id values have no special meaning, but the Parent ID
column must have a comma-separated list with id values instead.
The assay annotations file may optionally have a Name
column.
If present, the values in this columns are used as names on the child assays.
Otherwise, they are given default names (usually the same name as the
parent assay).