Ticket #1440: bfs-spotdata-import-1.txt

File bfs-spotdata-import-1.txt, 7.1 KB (added by Nicklas Nordborg, 14 years ago)

Specification for BFS with spotdata that is imported to BASE

Line 
1
2This document describes how the BFS format is used with bioassay spot
3data when communicating with plug-ins.
4
5A typical plug-in execution sequence is:
6 1. Export current data to BFS
7 2. Execute the plug-in which processes the data
8 3. Import the transformed data to BASE
9
10
11This document discusses the import part of the procedure.
12
13The import takes place after a plug-in has taken some kind of action on
14the exported data and generated one or more output files. BASE can import
15the following type of information:
16
17 * Intensity values (logged or non-logged). One value for each channel is
18 required.
19 * Extra values. As many as the plug-in generates.
20 * Reporter lists. As many as the plug-in generates.
21
22The number of files needed and where to place the information depends on
23the subtype (matrix or serial) that is used. A plug-in should output the
24same subtype as it got for input.
25
26A plug-in can also generate any other type of file, for example, images,
27pdf files, etc. These files are only uploaded to BASE and attached to the
28new bioassay set.
29
30The plug-in must create a metadata file so that the importer knows what
31to look for.
32
33The metadata file (import)
34==========================
35
36There are two BFS subtypes:
37
38* matrix: One data file is required for each value/formula to
39 import. The columns in the data files represents assays.
40
41* serial: One data file is required for each assay. The columns
42 in the data files represents values/formulas.
43
44Files
45-----
46
47The [files] section is used to name the data files. The following
48entries are recognised and required:
49
50 * rdata: The filename of a file containing new reporter information. The ID
51 column is always the position number which must be a unique positive
52 integer. Additional columns may be required depending on the import
53 settings.
54 * pdata: The filename of the file containing new assay information.
55 The ID column is in most cases the ID of the parent assay, but
56 if the 'multi-assay-parents' setting has been enabled, the ID can be any
57 positive unique integer, and the Parent ID column holds a list of
58 parent ID:s.
59 * sdata1,...,sdataN: N entries numbered from 1 to N with the filenames
60 of the files containing spot data. If the 'serial' subtype is used there
61 should be one file for each assay in the bioassay set. If the 'matrix'
62 subtype is used there should be one file for each entry in the [sdata]
63 section.
64
65Additionally, all entries starting with 'x-' are considered to be extra files
66that should be uploaded to BASE and attached to the new bioassay set.
67
68Settings
69--------
70
71The [settings] is used to control some aspects of the import. The following
72settings have been defined:
73
74 * new-data-cube: If a value of '1' is specified the data is imported into a
75 new data cube. A new data cube is needed whenever the position/reporter
76 mapping has been changed or when parent assays have been merged. When a
77 new data cube is used the 'rdata' file needs one of 'Internal ID' or
78 'External ID' columns so that the importer can map that position to a
79 reporter.
80 * multi-assay-parents: If a value of '1' is specified, it indicates that child
81 assays may have more than one parent assay (eg. due to a merge). A new
82 data cube is needed and this setting is ignored, unless also the
83 'new-data-cube' settings has been enabled. The 'pdata' file must have
84 'Parent ID' column that holds a comma-separated list with the ID:s of the
85 parent assays.
86 * transform: If not specified, the child spot data is assumed to use the same
87 intensity transform as the parent data. The values to choose from are: NONE,
88 LOG2, LOG10.
89
90Spot data
91---------
92
93The [sdata] section contains metadata about the spot data (intensity values
94and spot extra values) that the plug-in generated. The order in this section
95is important.
96
97If the 'matrix' subtype is used the order must correspond to the 'sdataX'
98entries in the [files] section. Eg. The file named for key 'sdata1' is data
99for the first entry in this section.
100
101If the 'serial' subtype is used the order must correspond to the column
102order inside each of the 'sdataX' files. Eg. the first column is data for
103the first entry in this section.
104
105Entries with keys like 'Ch 1', 'Ch 2', etc. are reserved and corresponds to
106channel intensities. There must be exactly one entry for each channel in the
107experiment.
108
109Data values are always float values but they may be logged. This is conrolled
110by the 'transform' settings. All intensities must use the same intensity
111transform.
112
113Entries starting with 'x-' are extra values. The values are either in separate
114files (matrix subtype) or in their own columns (serial subtype). The value is
115the data type of the extra value. Allowed values are: 'text', 'float' and 'int'.
116The part of the key after 'x-' should be the name or external id of an already
117existing extra value type.
118
119Example:
120
121[sdata]
122ch1 float
123ch2 float
124x-abc float
125
126
127Reporter annotation file (import)
128=================================
129
130This file is used to link spot data with the correct positions in the bioassay
131set. Required columns depends on if data is imported to the same data cube as
132the parent or not.
133
134 * ID: The position numbers. This column is always needed. Values must be
135 positive integers and duplicates are not allowed. The order doesn't matter.
136 Since the position number has no specific meaning, we recommend that plug-
137 ins that generate data for a new data cube simply start at 1 and then
138 increment the value for each line.
139 * Internal ID or External ID: Either the internal or external id:s of the
140 reporter that is assigned to the given position. At least one of those
141 columns are needed when importing data to a new data cube. The same reporter
142 may be assigned to more than one position and the reporter must already
143 exist in BASE.
144
145All sdata files should have the same number of rows (not counting the header
146line) as this file.
147
148Assay annotation file (import)
149==============================
150
151This file is used to link spot data with the correct child assay. This file
152should have one entry for each child bioassay that should be created.
153
154 * ID: Either the ID of a parent assay or a unique positive integer. This
155 column is always needed. If the 'multi-assay-parents' option is enabled
156 there is no special meaning to the value, otherwise the ID must be the
157 ID of the parent assay.
158 * Name: An optional column. If present, the child assay will be given the
159 specified name. Otherwise a name is automatically generated. Typically
160 the same as the parent assay.
161 * Parent ID: Required if 'multi-assay-parents' is enabled. The value is a
162 comma-separated list of parent assay ID:s.
163
164If the 'serial' subtype is used, the number of lines in this file should match
165the number of 'sdataX' entries in the [files] section. Data for the assay on
166the first line is found in the file specified by sdata1 and so on.
167
168If the 'matrix' subtype is used, the number of lines in this file should match
169the number of columns in each of the 'sdataX' files. Data for the assay on the
170first line is found in the first column in each data file and so on.
171
172Data files (import)
173===================
174
175Data files should follow the same rules as exported data files.