Ticket #1028: batchimport-3.txt

File batchimport-3.txt, 12.1 KB (added by Nicklas Nordborg, 16 years ago)

3rd version of developers view

Line 
1=====================
2Generic item importer
3=====================
4
5This is a description of an import plug-in that can be used to import almost any
6kind of item from a tab-separated text file (or compatible). The plug-in should
7be able to create new items or update existing items. In both cases it should be
8able to set values for:
9
10 * Simple properties. Eg. string values, numeric values, dates, etc.
11 * Single-item references: Eg. protocol, label, software, owner, etc.
12 * Multi-item references: Eg. the labeled extracts of a hybridization,
13 pooled samples, etc. In some cases a multi-item reference is bundled
14 with simple values. Eg. used quantity of a source biomaterial, the array
15 index a labeled extract is used on, etc. Multi-item references are never
16 removed by the importer, only added or updated. Removing an item from a
17 multi-item reference is a manual procedure to be done using the web
18 interface.
19
20The importer should not be able to set values for annotations since this is handled
21by the already existing annotation importer plug-in. The annotation importer
22and item importer should have similar behaviour and functionality to minimize
23the learning cost for users.
24
25Key features
26------------
27
28The item importer is only expected to work on a single type of item at each use
29and should read data from a single file.
30
31The importer should be able to work in 'dry-run' mode. Eg. everything is performed
32as if a real import is taking place, but the work (transaction) is not committed
33to the database.
34
35The importer should implement user-controlled error handling. Summary results,
36eg. number of items imported, number of failed items, etc. should be kept track of
37and reported as the job status message. More details about the error handling
38options can be found later in this document.
39
40The plug-in should support logging more detailed error message. Eg. reasons for
41failed items, line numbers, etc. to a separate file (in the BASE file system).
42[IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
43otherwise a complete failure will erase the file as well when the transaction
44is rolled back.
45
46The plug-in should be configurable and be able to store file parsing settings,
47including column mappings, etc. as plug-in configurations.
48
49File format
50-----------
51
52The input file should be organised into columns separated by a specified character.
53Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
54file that can be parsed with the FlatFileParser class.
55
56The first data line contains the column headers which defines the contents of each
57column. The column headers can be mapped to item properties at each use of the
58plug-in or by saving predefined settings as a plug-in configuration. This also
59includes separator character and other information that is needed to parse the
60file. Saved configurations should implement auto-detection functionality.
61
62Data for a single item may be split onto multiple lines. The first line contains
63simple properties and single-item references, and the first multi-item reference.
64If there are more multi-item references they should be on the following lines with
65empty values in all other columns, except for the column holding the item
66identifier, which must have the same value on all lines. If the following
67lines contains other data, this should be ignored, or it may be considered an
68error condition. It may be caused by giving two items the same name by accident.
69
70When reading data for an item the plug-in needs to know if it should create a new
71item or update an existing item. First, we need to know the method for identifying
72items. Depending on the item type there are two or three options:
73
74 * Using the internal 'id'. This is always unique.
75 * Using the 'name'. This may or may not be unique.
76 * Some items have an 'externalId'. This may or may not be unique.
77 * Array slides may have a 'barcode' which is similar to the externalId.
78
79Other items may have other properties that may be used for identification. It
80would be good to implement the item lookup part in a way that makes it easy to
81add new lookup methods when the need arises.
82
83The plug-in should ask the user which method to use. The user must also tell the
84plug-in among which items it should look for an item with a given 'name' or
85'externalID'. There are four options, and the user may select one or several
86of them:
87
88 * Owned by the logged in user
89 * Shared to the logged in user
90 * In the current project
91 * Owned by other users (only available if the logged in user has enough
92 permissions, eg. generic read permission for the item type)
93
94If the 'id' method is used, the above options are not used. When the plug-in
95is looking for an item there are three possible outcomes.
96
97 * No item is found. This can be handled in different ways:
98 - An error condition which aborts the plug-in
99 - The line is ignored
100 - A new item is created
101 * One item is found. This is the item that is going to be updated.
102 * More than one item is found. This can be handled in different ways:
103 - An error condition which aborts the plug-in
104 - The line is ignored
105
106
107Parsing the data.
108=================
109
110Simple properties
111-----------------
112We need to know the data type of each property as a Type object.
113The string values can then be parsed with Type.parseString().
114
115Single-item references
116-----------------------
117
118This is either the 'id', 'name', 'externalId' or another natural identifier
119of the item. The plug-in should support those cases by a single column
120mapping and an option to select which method to use.
121
122[NOTE]
123This creates two input parameters for each columns which may be too many...
124Alternative options are:
125
126 * A global option for all item references. This doesn't give the user
127 any chance to use different method for different items.
128 * A global option with an 'auto' alternative that uses the 'id' method
129 for numeric values, otherwise first the 'name' and if no item is
130 found the 'externalId'.
131 * No options at all. The "best" method is selected by the plug-in depending
132 on the item that is going to be looked up and users are required to
133 follow this.
134
135[JH: I think one of the alternatives would be better than a plethoria
136of parameters ... I like the auto idea but will it create though
137conditions on names used on items?]
138
139------
140
141When looking for item references the plug-in doesn't have to use the same
142setting for 'owned by', 'shared to', etc as when looking for the main items.
143In fact, this is not desired since many of those items are owned by the root
144user or a system administrator. Eg. labels, software, hardware, etc. I don't
145think it is practical to have another option for selecting this for each type
146of item reference, so the default should be to look among all items the user
147has access to (with use permission)...
148
149There are three outcomes:
150
151 * No item is found.
152 - This can be an error condition that aborts the plug-in. If it is a
153 required property this will always happen.
154 - The link is ignored. No call is made to setABC() method. Note that this
155 case is different from having an empty column in which case
156 setABC(null) would be called.
157 * One item is found. This is the item we link to.
158 * Multiple items are found.
159 - This can be an error condition that aborts the plug-in.
160 - The link is ignored.
161
162
163Multi-item references
164---------------------
165
166This should work in the same way as single item references.
167
168
169Using the plug-in
170=================
171
172Configuring the plug-in is done with the usual wizard. There will be plenty of
173parameters so it is probably a good idea to use a multi-step wizard. This may have
174to be tried out by actual users before we make any final decisions.
175
176Step 1
177------
178
179The user selects a file and enter values for the regular expressions and other
180options for parsing the file. Column mappings are also specifiec in this
181step. The "Test with file" function should be supported. Parameters that are
182needed:
183
184 * A file to parse
185 * Data header: Regular expression for finding the start of data
186 * Data splitter: Regular expression that splits data lines into columns
187 * Remove quotes: boolean option that removes "quotes" around values
188 * Ignore: Regular expression that matches lines to be ignored
189 * Data footer: Regular expression for finding the end of data
190 * Min/max data columns: The number of columns a data line must have, otherwise
191 it is ignored
192 * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file
193 * Decimal separator: if dot or comma is used as a decimal separator for numeric values
194 * Date format: The date format used in the file
195
196The above parameter are the same as those found in many of the existing import
197plug-ins.
198
199Since each type of item has different properties, colum mapping parameters vary
200from case to to case. Column mapping parameters may need to be divided into
201subsections for clarity.
202
203For the ID property we need one column mapping parameter, one enum parameter
204to select which identification method to use and boolean parameters for selecting
205which items to search.
206
207For simple properties we need a single column mapping parameter.
208
209For single-item references we need one column mapping parameter
210[and one enum parameter to select which identification method to use].
211
212All options (except the file to parse) in this step should also be available
213to store as a plug-in configuration.
214
215
216Step 2
217------
218
219This step is mainly about error handling options. Default values are marked with
220*stars*.
221
222 * Default error handling: *fail*, skip line
223 * Item not found: fail, *create*, skip line
224 * Multiple items found: *fail*, skip line
225 * Referenced item not found: *fail*, ignore, skip line
226 * Multiple referenced items found: *fail*, ignore, skip line
227 * Missing a required property: *fail*, skip line
228 * String too long: *fail*, crop, ignore
229 * Invalid numeric value: *fail*, null, ignore
230 * Numeric value out of range: *fail*, ignore
231 * A log file for detailed error messages
232 * A boolean parameter for selecting 'dry-run'
233
234If there are multi-item references the 'skip line' option above means that we
235should skip all lines that are related to the same item.
236
237Implementation details
238======================
239
240We need some kind of basic, generic functionality that is handling the file parsing,
241property mapping, item lookup, error handling, logging, etc.
242
243We also need functionality that is specific for each type of item the plug-in
244should support. We need to know which properties that exists on the items. For each
245property we need to know the data type or if the property is a single-item or a
246multi-item reference. We need factory methods for creating new items etc.
247
248If possible, it should also be relatively easy to extend the item importer with
249support for other item types in the future.
250
251One possible approach is to use an abstract base class for the common functionality.
252This class defines some abstract methods that must be implemented by subclasses
253where each subclass handles a single type of item. This approach makes it relatively
254easy to add support for other item types just by creating a new subclass. This
255approach creates a separate plug-in for each item type. Eg. SampleImporter,
256ExtractImporter, etc.
257
258
259Item specific functionality that is needed
260------------------------------------------
261
262 * Which item lookup methods that are supported by the item. Name and id
263 will be supported by all items, some have external id, etc.
264
265 * List of properties that can be imported. For each property we must know:
266 - if it is a simple value, a single-item reference or a multi-item reference
267 - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
268 - if the property is required or not
269 - name and description and other details for better user experience
270
271 * A factory method for creating new items. Some items can be created without
272 any parameters (eg. Sample.getNew()), some requires one or more parameters
273 (eg. LabeledExtract.getNew(Label).
274
275 * Find an item. This functionality is alredy implemented by the annotation importer,
276 but it is not very flexible since it uses reflection to find the getQuery() method.
277 The annotation importer only works with items were the getQuery() method doesn't
278 require any parameters.
279
280Supported item types
281====================
282
283See batchimport_userperspective.txt