M.1. The BFS (BASE File Set) format

M.1. The BFS (BASE File Set) format
Prev	Appendix M. File formats	Next

The BASE File Set (BFS) format is a collection of file formats that can be used together to transport all kinds data. The major use is to send spot data to a plug-in for analysis and then to import the analyzed results. We have tried to keep the format generic and extendable so it is not unlikely that the BFS format can be used for other applications in the future.

M.1.1. The basics of BFS

The idea is to use simple, plain-text files with data organised into rows and columns. A single type of file may not be able to hold all kinds of data, so to begin with we have defined three types of files:

Metadata files: Holds information about the data that is found in the other files in the file set.
Annotation files: Column-based files that holds one record per line. The first line is a header line. The remaining lines are data lines identified by a unique positive ID value in the first column.
Data files: Pure matrix data files without header lines or ID columns. Data is usually identified by matching it line-by-line with data in annotation files, or with information in the metadata file.

Character encoding

All files are text-based and should use the UTF-8 character encoding. A newline (\n) is used as a record separator and a tab (\t) is used a column separator. Data that contains newline or tab characters need to be escaped. A backslash (\) is used to indicate the start of an escaped sequence. This means that the backslash character must also be escaped. Since some editors includes a carriage return (\r) in line breaks, we should also escape carriage return.

Table M.1. Escaped characters in the BFS format

Character	Escape sequence
`<backslash>`	`\\`
`<newline>`	`\n`
`<carriage return>`	`\r`
`<tab>`	`\t`

It is recommended that parsers are forgiving and if an invalid escape sequence is found, eg. a backslash followed by anything else than \, n, r or t, the input is taken literally. Strict parsers may throw exceptions or log warning messages.

Numerical values

Numeric values should use dot (.) as decimal point. Scientific notation is accepted. Null, NaN, Infinity, and other special values should all be represented by empty string values. It is recommended that parsers are forgiving and treat invalid numerical data as empty values.

Comments and white-space

Lines starting with # are comment lines and should be ignored. Empty lines should also be ignored. A line that contains only white-space is considered as empty. White-space=spaces, tabs and other characters that matches \s in regular expressions.

	Note
	This can only be used in metadata files. Annotation files and data files doesn't allow comments or empty lines.

Metadata files

A BASE File Set usually contains one metadata file. This file contains information about the other files that make up the file set. The metadata file can also hold information that is specific to a use case.

A metadata file always starts with the beginning-of-file (BOF) marker BFSformat, optionally followed by a tab and a value indicating the subtype of the file. This must be the first line of the file. Comments or empty lines are not allowed before the beginning-of-file marker.

All data in a metadata file must be inside a section. A section is started by surrounding a value in brackets on a line by it's own, for example, [my section]. There is no restriction on the name of the section as long as it is escaped using the normal rules. Note that there is no need to escape brackets in the name. For example, [[a\\b]] is a valid section with the name [a\b]. Trailing white-space after the closing bracket is ignored.

Multiple sections may have the same name, and the order of the sections is usually of no concern. However, this may be restricted in specific cases if there is need to, for example, require unique section names or enforce a specific order. Parsers are recommended to provide access to sections by name and by ordinal number, starting at 0 and writers are recommended to write sections in the order they are added.

Each section contains data in the form of tab-separated key-value pairs. Keys may not start with # or [ since this would interfere with comments and sections. Otherwise, the normal escape rules should be used for both keys and values. Values are allowed to use non-escaped tab characers, which makes it possible to use vector-type values.

A key doesn't have to be unique within a section, but specific use cases may require this globally or on section-per-section basis. The order of the keys are usually not important, except if the use case requires it. Parser implementations are recommended to provide access to keys by name and by ordinal number, starting at 0. Generic writers implementations are recommended to write keys and values in the order they are added to each section.

If the file set includes more files than the metadata file, those files should be listed in the [files] section. Keys should be unique, but there are no other restrictions. The value is the file name without path information. The files are expected to be located in the same container as the current metadata file. A container could for example be a folder in the file system, a zip-file, or any other logical item that group files. Metadata about the files and file content is not part of the generic BFS specification. This is left to specific use cases.

	Note
	Files doesn't have to be other BFS file types. It can be any type of files, like pdf files, images, etc.

Example M.1. Example BFS metadata file

BFSformat	subtype
# The 'BFSformat' must be on the the first line, subtype is optional
# A comment line starts with '#'. Empty lines are ignored

# A section is started by enclosing the section name in brackets
# Section entries are key/value pairs separated by tab
# Vector-type values are allowed. Duplicate keys may or may
# not be allowed depending on the use case.
[settings]
key-1	value1
key-2	value2a	value2b

# The 'files' section points to additional files in the file set
# Keys should be unique
[files]
report	report.txt
table	tabla-data.txt
plot	plotted-data.png

Annotation files

The first line is a header line containing the column names for each column. The first column is required and must always be ID. Other columns are optional, but must have unique names. Column names are separated with tabs and are encoded using the normal rules. All other lines are data lines. Each line must have exactly the same number of columns as the header line. Comment lines and empty lines are not supported, but a column may have an empty value.

The ID column holds a unique identifier used internally by BASE. A given ID should only be used once and may not be repeated later in the file. The ID is a numeric positive integer value. Zero, negative or empty values are not allowed. There is no special ordering (unless a specific use-case require this). Note that the ID values are not indexes. They don't have to start at 1 and there may be "holes" in the range of values used. Some use-cases may use ID values with some specific meaning, other use-cases may simple enumerate the rows using a counter.

Data files

A data file is a matrix containing one data value for each row-column element. Data starts on the first line. There is no header line. All data lines must have the same number of columns. The number of rows and columns and their order are defined by other, use-case specfic, information in the metadata file or in annotation file(s). Comment lines and empty lines are not supported, but a column may hold an empty value.

M.1.2. Using BFS for spotdata to and from external plug-ins

The use case is to use BFS to transport data to and from an external analysis plug-in. The general outline is:

Export bioassay set data to BFS.
Execute the external plug-in which process the data and generates a new BFS.
Import the transformed data to BASE.

The export will generate at least two files. One metadata file and one data file. It is also possible to export reporter and assay annotations if the plug-in needs it. Note that reporter and assay annotation files are always needed if new spot data is going to be imported so in most cases at least four files will be created.

The metadata file

There are two subtypes:

serial: One data file is required for each assay. The columns in the data files represents different spot data values, eg. first column = Ch 1, second column = Ch 2, etc.
matrix: One data file is required for each spot data value. The columns in the data files represents assays.

For both subtypes the [files] section is used to name the files holding data and annotations. The following entries should be used:

rdata: The filename of the file containing reporter annotations
pdata: The filename of the file containing assay annotations
sdata1, sdata2, ..., sdataN: N entries, numbered from 1 to N, with the filenames of the files containing spot data. If the serial subtype is used there should be one file for each assay in the bioassayset. If the matrix subtype is used, there should be one file for each entry in the [sdata] section.

Other files may be included if they use x- as a prefix.

Example:

BFSformat	serial
[files]
rdata reporters.txt
pdata assays.txt
sdata1  Assay 1.txt
sdata2  Assay 2.txt
x-custom  custom.txt

The [sdata] section contains information about the spot data that is found in the sdataX files. The key of each entry is the name or title of the data that is exported. The value describes the data type and can be either text, float or int.

The order in this section is important. If the matrix subtype is used, the entries in this section must match the sdataX entries in the [files] section. Eg. the data that corresponds to the first entry in this section is found in the sdata1 file. The number of entries in this section must be the same as the number of sdataX entries in the [files] section.

If the serial subtype is used the entries in this section must match the column order in each of the sdataX files. Eg. the data that corresponds to the first entry in this section is found in the first column in all sdataX files. The number of entries in this section must match the number of columns in the sdataX files.

Example:

[sdata]
Ch 1  float
Ch 2  float
Weight  float
Flag  int

The [parameters] section contains extra parameters needed by the plug-in. Keys and values are defined by the plug-in and/or job configuration. Duplicate keys are not allowed, and order is not important. Multiple values for the same parameter are separated with a tab character.

Example:

[parameters]
beta  0.5
length  100
vector  10  10.3  23
median  true

Reporter and assay annotations

The file used for reporter annotations is given by the rdata entry in the [files] section. This file is optional when exporting but required when importing. The only required column is the ID column, which holds the internal spot position values. All sdataX files must have the same number of rows as this file (not counting the header line) and data should be sorted in the same order. Additional columns may be included in the export.

Note that the same underlying reporter may be assigned to more than one position. If the plug-in needs to operate on merged-per-reporter data the export should include either the internal or external reporter id in an additional column so that the plug-in can use this information to determine what should be merged. The exporter has no support for exporting merged data.

The file used for assay annotations is given by the pdata entry in the [files] section. This file is optional when exporting but required when importing. The only required column is the ID column, which holds the interal bioassay id values. If the matrix subtype is used the columns in the sdataX files must be in the same order as the assays appear in this file. The number of columns in the data files must be the same as the number of rows in this file (not counting the header line).

If the serial subtype is used, the sdata1 file has data for the assay that is described in the first line in this file, the sdata2 file has data for the second assay, etc. The number of data files must match the number of lines in this file.

Data files

Data files contains data in matrix format. More than one data file may be required. The organisation of the data depends on the BFS subtype. In both subtypes the number and order of the rows must match the number and order of rows in the reporter annotations file.

If the matrix subtype is used, the columns in the data files corresponds to assays. The number of columns and their order must match the lines in the assay annotations file. The number of data files and their content is defined by the entries in the [sdata] section.

If the serial subtype is used, the the number of columns and their order must match the entries in the [sdata] section. Each data file has data from one assay. The number of sdata files in the [files] section must match the number of lines in the assay annotations file.

Importing spot data

The above information is mostly true for both export and import, but there are a few additional things that a plug-in should know about when generating data that is going to be imported. The most important thing is that both reporter and assay annotation files are required for importing spot data. If the program only generates extra files the [sdata] section should not be included and no data or annoatation files are need. All files are specified in the [files] section in the same way as for the export. File entries starting with x- will be uploaded to BASE and linked with the new bioassay set.

	Note
	The importer currently supports importing spot data intensity values and extra files. Position/reporter mapping and child/parent assay mapping may remain the same or they may be changed. The importer can also upload additional files generated by the plug-in, for example plots. The importer has no support for importing extra values, reporter lists or annotations.

In the metadata file, a [settings] section may be included to control certain aspects of the import. The following entries can be used:

new-data-cube: If this is set, the data is imported into a new data cube. A new data cube is needed whenever the position/reporter mappings has changed or when parent assays has been merged. This setting requires that the reporter annotations file contains information about the new mapping. It needs to include either Internal ID or External ID columns so that the importer can map the new position to the correct reporter. The reporter must already exist in the database. The position values have no relation to the position values in the old bioassay set. We recommend that a plug-in simply starts enumerates the lines starting at 1.
multi-assay-parents: If this is set, a child assay may have more than one parent assay (for example, due to a merge). A new data cube is needed and this setting is ignored unless new-data-cube is also set. This setting requires that the assay annotations file has a Parent ID column which holds a comma-separated list with the ID:s of the parent assays.
transform: If not specified, the child spot data is assumed to use the same intensity transform as the parent data. To force a specific a specific intensity transform for the child bioassay set include this setting and choose one fo the values: none, log2, log10.

In the metadata file, the precense of an [sdata] section indicates that spot data should be imported. If this section is not present only extra files are uploaded to BASE and they are attached to the transformation instead of a child bioassay set. If the [sdata] section is present it must include one entry for each channel with names like, Ch 1, Ch 2, and so on. The value is always float. All other entries in this section are ignored.

In the reporter annotations file, the ID column should hold the position values. Values must be positive integers and duplicates are not allowed. The order of the values doesn't matter. If importing data to a new data cube the reporter annotations file also needs either Internal ID or External ID columns.

In the assay annotations file, the ID column usually holds the internal assay id of the parent assay. The exception is if the multi-assay-parents options has been enabled. In this case the id values have no special meaning, but the Parent ID column must have a comma-separated list with id values instead.

The assay annotations file may optionally have a Name column. If present, the values in this columns are used as names on the child assays. Otherwise, they are given default names (usually the same name as the parent assay).