Ticket #1440: bfs-spotdata-import-1.txt

File bfs-spotdata-import-1.txt, 7.1 KB (added by nicklas, 2 years ago)

Specification for BFS with spotdata that is imported to BASE

Line 
1
2This document describes how the BFS format is used with bioassay spot
3data when communicating with plug-ins.
4
5A typical plug-in execution sequence is:
6 1. Export current data to BFS
7 2. Execute the plug-in which processes the data
8 3. Import the transformed data to BASE
9
10
11This document discusses the import part of the procedure.
12
13The import takes place after a plug-in has taken some kind of action on
14the exported data and generated one or more output files. BASE can import
15the following type of information:
16
17 * Intensity values (logged or non-logged). One value for each channel is
18   required.
19 * Extra values. As many as the plug-in generates.
20 * Reporter lists. As many as the plug-in generates.
21
22The number of files needed and where to place the information depends on
23the subtype (matrix or serial) that is used. A plug-in should output the
24same subtype as it got for input.
25
26A plug-in can also generate any other type of file, for example, images,
27pdf files, etc. These files are only uploaded to BASE and attached to the
28new bioassay set.
29
30The plug-in must create a metadata file so that the importer knows what
31to look for.
32
33The metadata file (import)
34==========================
35
36There are two BFS subtypes:
37
38* matrix: One data file is required for each value/formula to
39  import. The columns in the data files represents assays.
40
41* serial: One data file is required for each assay. The columns
42  in the data files represents values/formulas.
43
44Files
45-----
46
47The [files] section is used to name the data files. The following
48entries are recognised and required:
49
50 * rdata: The filename of a file containing new reporter information. The ID
51   column is always the position number which must be a unique positive
52  integer. Additional columns may be required depending on the import
53  settings.
54 * pdata: The filename of the file containing new assay information.
55   The ID column is in most cases the ID of the parent assay, but
56  if the 'multi-assay-parents' setting has been enabled, the ID can be any
57  positive unique integer, and the Parent ID column holds a list of
58  parent ID:s.
59 * sdata1,...,sdataN: N entries numbered from 1 to N with the filenames
60   of the files containing spot data. If the 'serial' subtype is used there
61   should be one file for each assay in the bioassay set. If the 'matrix'
62   subtype is used there should be one file for each entry in the [sdata]
63   section.
64
65Additionally, all entries starting with 'x-' are considered to be extra files
66that should be uploaded to BASE and attached to the new bioassay set.
67
68Settings
69--------
70
71The [settings] is used to control some aspects of the import. The following
72settings have been defined:
73
74 * new-data-cube: If a value of '1' is specified the data is imported into a
75   new data cube. A new data cube is needed whenever the position/reporter
76  mapping has been changed or when parent assays have been merged. When a
77  new data cube is used the 'rdata' file needs one of 'Internal ID' or
78  'External ID' columns so that the importer can map that position to a
79  reporter.
80 * multi-assay-parents: If a value of '1' is specified, it indicates that child
81   assays may have more than one parent assay (eg. due to a merge). A new
82  data cube is needed and this setting is ignored, unless also the
83  'new-data-cube' settings has been enabled. The 'pdata' file must have
84  'Parent ID' column that holds a comma-separated list with the ID:s of the
85  parent assays.
86 * transform: If not specified, the child spot data is assumed to use the same
87   intensity transform as the parent data. The values to choose from are: NONE,
88  LOG2, LOG10.
89
90Spot data
91---------
92
93The [sdata] section contains metadata about the spot data (intensity values
94and spot extra values) that the plug-in generated. The order in this section
95is important.
96
97If the 'matrix' subtype is used the order must correspond to the 'sdataX'
98entries in the [files] section. Eg. The file named for key 'sdata1' is data
99for the first entry in this section.
100
101If the 'serial' subtype is used the order must correspond to the column
102order inside each of the 'sdataX' files. Eg. the first column is data for
103the first entry in this section.
104
105Entries with keys like 'Ch 1', 'Ch 2', etc. are reserved and corresponds to
106channel intensities. There must be exactly one entry for each channel in the
107experiment.
108
109Data values are always float values but they may be logged. This is conrolled
110by the 'transform' settings. All intensities must use the same intensity
111transform.
112
113Entries starting with 'x-' are extra values. The values are either in separate
114files (matrix subtype) or in their own columns (serial subtype). The value is
115the data type of the extra value. Allowed values are: 'text', 'float' and 'int'.
116The part of the key after 'x-' should be the name or external id of an already
117existing extra value type.
118
119Example:
120
121[sdata]
122ch1 float
123ch2 float
124x-abc float
125
126
127Reporter annotation file (import)
128=================================
129
130This file is used to link spot data with the correct positions in the bioassay
131set. Required columns depends on if data is imported to the same data cube as
132the parent or not.
133
134 * ID: The  position numbers. This column is always needed. Values must be
135   positive integers and duplicates are not allowed. The order doesn't matter.
136  Since the position number has no specific meaning, we recommend that plug-
137  ins that generate data for a new data cube simply start at 1 and then
138  increment the value for each line.
139 * Internal ID or External ID: Either the internal or external id:s of the
140   reporter that is assigned to the given position. At least one of those
141  columns are needed when importing data to a new data cube. The same reporter
142  may be assigned to more than one position and the reporter must already
143  exist in BASE.
144
145All sdata files should have the same number of rows (not counting the header
146line) as this file.
147
148Assay annotation file (import)
149==============================
150
151This file is used to link spot data with the correct child assay. This file
152should have one entry for each child bioassay that should be created.
153
154 * ID: Either the ID of a parent assay or a unique positive integer. This
155   column is always needed. If the 'multi-assay-parents' option is enabled
156  there is no special meaning to the value, otherwise the ID must be the
157  ID of the parent assay.
158 * Name: An optional column. If present, the child assay will be given the
159   specified name. Otherwise a name is automatically generated. Typically
160  the same as the parent assay.
161 * Parent ID: Required if 'multi-assay-parents' is enabled. The value is a
162   comma-separated list of parent assay ID:s.
163
164If the 'serial' subtype is used, the number of lines in this file should match
165the number of 'sdataX' entries in the [files] section. Data for the assay on
166the first line is found in the file specified by sdata1 and so on.
167
168If the 'matrix' subtype is used, the number of lines in this file should match
169the number of columns in each of the 'sdataX' files. Data for the assay on the
170first line is found in the first column in each data file and so on.
171
172Data files (import)
173===================
174
175Data files should follow the same rules as exported data files.