Ticket #1028: batchimport-4.txt

File batchimport-4.txt, 11.7 KB (added by Nicklas Nordborg, 16 years ago)

4th version of developers view

Line 
1=====================
2Generic item importer
3=====================
4
5This is a description of an import plug-in that can be used to import almost any
6kind of item from a tab-separated text file (or compatible). The plug-in should
7be able to create new items or update existing items. In both cases it should be
8able to set values for:
9
10 * Simple properties. Eg. string values, numeric values, dates, etc.
11 * Single-item references: Eg. protocol, label, software, etc.
12 * Multi-item references: Eg. the labeled extracts of a hybridization,
13 pooled samples, etc. In some cases a multi-item reference is bundled
14 with simple values. Eg. used quantity of a source biomaterial, the array
15 index a labeled extract is used on, etc. Multi-item references are never
16 removed by the importer, only added or updated. Removing an item from a
17 multi-item reference is a manual procedure to be done using the web
18 interface.
19
20The importer should not be able to set values for annotations since this is handled
21by the already existing annotation importer plug-in. The annotation importer
22and item importer should have similar behaviour and functionality to minimize
23the learning cost for users.
24
25Key features
26------------
27
28The item importer is only expected to work on a single type of item at each use
29and should read data from a single file.
30
31The user should be able to select if new items should be create and/or if existing
32items should be updated.
33
34The importer should be able to work in 'dry-run' mode. Eg. everything is performed
35as if a real import is taking place, but the work (transaction) is not committed
36to the database.
37
38The importer should implement user-controlled error handling. Summary results,
39eg. number of items imported, number of failed items, etc. should be kept track of
40and reported as the job status message. More details about the error handling
41options can be found later in this document.
42
43The plug-in should support logging more detailed error message. Eg. reasons for
44failed items, line numbers, etc. to a separate file (in the BASE file system).
45[IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
46otherwise a complete failure will erase the file as well when the transaction
47is rolled back.
48
49The plug-in should be configurable and be able to store file parsing settings,
50including column mappings, etc. as plug-in configurations.
51
52File format
53-----------
54
55The input file should be organised into columns separated by a specified character.
56Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
57file that can be parsed with the FlatFileParser class.
58
59The first data line contains the column headers which defines the contents of each
60column. The column headers can be mapped to item properties at each use of the
61plug-in or by saving predefined settings as a plug-in configuration. This also
62includes separator character and other information that is needed to parse the
63file. Saved configurations should implement auto-detection functionality.
64
65Data for a single item may be split onto multiple lines. The first line contains
66simple properties, single-item references, and the first multi-item reference.
67If there are more multi-item references they should be on the following lines and
68the identifier column must have exactly the same value. Data in the columns
69for simple properties and single-item references is ignored on multi-lines.
70The multi-line entry ends as soon when a line with a different identifier is
71found or when the file end is reached.
72
73When reading data for an item the plug-in needs to know if it should create a new
74item or update an existing item. First, we need to know the method for identifying
75items. Depending on the item type there are two or three options:
76
77 * Using the internal 'id'. This is always unique.
78 * Using the 'name'. This may or may not be unique.
79 * Some items have an 'externalId'. This may or may not be unique.
80 * Array slides may have a 'barcode' which is similar to the externalId.
81
82Other items may have other properties that may be used for identification. It
83would be good to implement the item lookup part in a way that makes it easy to
84add new lookup methods when the need arises.
85
86The plug-in should ask the user which method to use. The user must also tell the
87plug-in among which items it should look for an item with a given 'name' or
88'externalID'. There are four options, and the user may select one or several
89of them:
90
91 * Owned by the logged in user
92 * Shared to the logged in user
93 * In the current project
94 * Owned by other users (only available if the logged in user has enough
95 permissions, eg. generic read permission for the item type)
96
97If the 'id' method is used, the above options are not used. In all cases, the
98plug-in will only consider items for which the logged in user has write
99permission. When the plug-in is looking for an item there are three possible
100outcomes.
101
102 * No item is found. This can be handled in different ways:
103 - An error condition which aborts the plug-in
104 - The line is ignored
105 - A new item is created
106 * One item is found. This is the item that is going to be updated.
107 * More than one item is found. This can be handled in different ways:
108 - An error condition which aborts the plug-in
109 - The line is ignored
110
111
112Parsing the data.
113=================
114
115Simple properties
116-----------------
117Converting the string values from the file is the responsibility of some
118item-specific code that knows what kind of values to expect in each data
119column.
120
121Single-item references
122-----------------------
123
124This is either the 'id', 'name', 'externalId' or another natural identifier
125of the item. The plug-in selects a "best" option for each kind of item.
126Typically, this means that a lookup by 'name' is tried first, and if
127no item is found, try the 'externalId'. As a last resort and if the value
128is numerical a lookup by 'internalId' is used.
129
130When looking for item references the plug-in doesn't have to use the same
131setting for 'owned by', 'shared to', etc as when looking for the main items.
132In fact, this is not desired since many of those items are owned by the root
133user or a system administrator. Eg. labels, software, hardware, etc. I don't
134think it is practical to have another option for selecting this for each type
135of item reference, so the default should be to look among all items the user
136has access to (with use permission).
137
138There are three outcomes:
139
140 * No item is found.
141 - This can be an error condition that aborts the plug-in. If it is a
142 required property this will always happen.
143 - The link is ignored. No call is made to setABC() method. Note that this
144 case is different from having an empty column in which case
145 setABC(null) would be called.
146 * One item is found. This is the item we link to.
147 * Multiple items are found.
148 - This can be an error condition that aborts the plug-in.
149 - The link is ignored.
150
151
152Multi-item references
153---------------------
154
155This should work in the same way as single item references.
156
157
158Using the plug-in
159=================
160
161Configuring the plug-in is done with the usual wizard. There will be plenty of
162parameters so it is probably a good idea to use a multi-step wizard. This may have
163to be tried out by actual users before we make any final decisions.
164
165Step 1
166------
167
168The user selects a file and enter values for the regular expressions and other
169options for parsing the file. Column mappings are also specifiec in this
170step. The "Test with file" function should be supported. Parameters that are
171needed:
172
173 * A file to parse
174 * Mode to use: create and/or update
175 * Data header: Regular expression for finding the start of data
176 * Data splitter: Regular expression that splits data lines into columns
177 * Remove quotes: boolean option that removes "quotes" around values
178 * Ignore: Regular expression that matches lines to be ignored
179 * Data footer: Regular expression for finding the end of data
180 * Min/max data columns: The number of columns a data line must have, otherwise
181 it is ignored
182 * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file
183 * Decimal separator: if dot or comma is used as a decimal separator for numeric values
184 * Date format: The date format used in the file
185
186The above parameter are the same as those found in many of the existing import
187plug-ins.
188
189For item identification we need an enum parameter to select which identification
190method to use and boolean parameters for selecting which items to search.
191
192Since each type of item has different properties, colum mapping parameters vary
193from case to to case. Column mapping parameters may need to be divided into
194subsections for clarity.
195
196 * For simple properties we need a single column mapping parameter.
197
198 * For single-item references we need one column mapping parameter.
199
200All options related to the file parsing should also be available to store as a
201plug-in configuration.
202
203
204Step 2
205------
206
207This step is mainly about error handling options. Default values are marked with
208*stars*.
209
210 * Default error handling: *fail*, skip line
211 * Item not found: fail, *create*, skip line
212 * Multiple items found: *fail*, skip line
213 * Referenced item not found: *fail*, ignore, skip line
214 * Multiple referenced items found: *fail*, ignore, skip line
215 * Missing a required property: *fail*, skip line
216 * String too long: *fail*, crop, ignore
217 * Invalid numeric value: *fail*, null, ignore
218 * Numeric value out of range: *fail*, ignore
219 * A log file for detailed error messages
220 * A boolean parameter for selecting 'dry-run'
221
222If there are multi-item references the 'skip line' option above means that we
223should skip all lines that are related to the same item.
224
225Implementation details
226======================
227
228We need some kind of basic, generic functionality that is handling the file parsing,
229property mapping, item lookup, error handling, logging, etc.
230
231We also need functionality that is specific for each type of item the plug-in
232should support. We need to know which properties that exists on the items. For each
233property we need to know the data type or if the property is a single-item or a
234multi-item reference. We need factory methods for creating new items etc.
235
236If possible, it should also be relatively easy to extend the item importer with
237support for other item types in the future.
238
239One possible approach is to use an abstract base class for the common functionality.
240This class defines some abstract methods that must be implemented by subclasses
241where each subclass handles a single type of item. This approach makes it relatively
242easy to add support for other item types just by creating a new subclass. This
243approach creates a separate plug-in for each item type. Eg. SampleImporter,
244ExtractImporter, etc.
245
246
247Item specific functionality that is needed
248------------------------------------------
249
250 * Which item lookup methods that are supported by the item. Name and id
251 will be supported by all items, some have external id, etc.
252
253 * List of properties that can be imported. For each property we must know:
254 - if it is a simple value, a single-item reference or a multi-item reference
255 - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
256 - if the property is required or not
257 - name and description and other details for better user experience
258
259 * A factory method for creating new items. Some items can be created without
260 any parameters (eg. Sample.getNew()), some requires one or more parameters
261 (eg. LabeledExtract.getNew(Label).
262
263 * Find an item. This functionality is alredy implemented by the annotation importer,
264 but it is not very flexible since it uses reflection to find the getQuery() method.
265 The annotation importer only works with items were the getQuery() method doesn't
266 require any parameters.
267
268Supported item types
269====================
270
271See batchimport_userperspective.txt