1 | =====================
|
---|
2 | Generic item importer
|
---|
3 | =====================
|
---|
4 |
|
---|
5 | This is a description of an import plug-in that can be used to import almost any
|
---|
6 | kind of item from a tab-separated text file (or compatible). The plug-in should
|
---|
7 | be able to create new items or update existing items. In both cases it should be
|
---|
8 | able to set values for:
|
---|
9 |
|
---|
10 | * Simple properties. Eg. string values, numeric values, dates, etc.
|
---|
11 | * Single-item references: Eg. protocol, label, software, owner, etc.
|
---|
12 | * Multi-item references: Eg. the labeled extracts of a hybridization,
|
---|
13 | pooled samples, etc. In some cases a multi-item reference is bundled
|
---|
14 | with simple values. Eg. used quantity of a source biomaterial, the array
|
---|
15 | index a labeled extract is used on, etc. Multi-item references are never
|
---|
16 | removed by the importer, only added or updated. Removing an item from a
|
---|
17 | multi-item reference is a manual procedure to be done using the web
|
---|
18 | interface.
|
---|
19 |
|
---|
20 | The importer should not be able to set values for annotations since this is handled
|
---|
21 | by the already existing annotation importer plug-in. The annotation importer
|
---|
22 | and item importer should have similar behaviour and functionality to minimize
|
---|
23 | the learning cost for users.
|
---|
24 |
|
---|
25 | Key features
|
---|
26 | ------------
|
---|
27 |
|
---|
28 | The item importer is only expected to work on a single type of item at each use
|
---|
29 | and should read data from a single file.
|
---|
30 |
|
---|
31 | The importer should be able to work in 'dry-run' mode. Eg. everything is performed
|
---|
32 | as if a real import is taking place, but the work (transaction) is not committed
|
---|
33 | to the database.
|
---|
34 |
|
---|
35 | The importer should implement user-controlled error handling. Summary results,
|
---|
36 | eg. number of items imported, number of failed items, etc. should be kept track of
|
---|
37 | and reported as the job status message. More details about the error handling
|
---|
38 | options can be found later in this document.
|
---|
39 |
|
---|
40 | The plug-in should support logging more detailed error message. Eg. reasons for
|
---|
41 | failed items, line numbers, etc. to a separate file (in the BASE file system).
|
---|
42 | [IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
|
---|
43 | otherwise a complete failure will erase the file as well when the transaction
|
---|
44 | is rolled back.
|
---|
45 |
|
---|
46 | The plug-in should be configurable and be able to store file parsing settings,
|
---|
47 | including column mappings, etc. as plug-in configurations.
|
---|
48 |
|
---|
49 | File format
|
---|
50 | -----------
|
---|
51 |
|
---|
52 | The input file should be organised into columns separated by a specified character.
|
---|
53 | Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
|
---|
54 | file that can be parsed with the FlatFileParser class.
|
---|
55 |
|
---|
56 | The first data line contains the column headers which defines the contents of each
|
---|
57 | column. The column headers can be mapped to item properties at each use of the
|
---|
58 | plug-in or by saving predefined settings as a plug-in configuration. This also
|
---|
59 | includes separator character and other information that is needed to parse the
|
---|
60 | file. Saved configurations should implement auto-detection functionality.
|
---|
61 |
|
---|
62 | Data for a single item may be split onto multiple lines. The first line contains
|
---|
63 | simple properties and single-item references, and the first multi-item reference.
|
---|
64 | If there are more multi-item references they should be on the following lines with
|
---|
65 | empty values in all other columns, except for the column holding the item
|
---|
66 | identifier, which must have the same value on all lines. If the following
|
---|
67 | lines contains other data, this should be ignored, or it may be considered an
|
---|
68 | error condition. It may be caused by giving two items the same name by accident.
|
---|
69 |
|
---|
70 | When reading data for an item the plug-in needs to know if it should create a new
|
---|
71 | item or update an existing item. First, we need to know the method for identifying
|
---|
72 | items. Depending on the item type there are two or three options:
|
---|
73 |
|
---|
74 | * Using the internal 'id'. This is always unique.
|
---|
75 | * Using the 'name'. This may or may not be unique.
|
---|
76 | * Some items have an 'externalId'. This may or may not be unique.
|
---|
77 | * Array slides may have a 'barcode' which is similar to the externalId.
|
---|
78 |
|
---|
79 | Other items may have other properties that may be used for identification. It
|
---|
80 | would be good to implement the item lookup part in a way that makes it easy to
|
---|
81 | add new lookup methods when the need arises.
|
---|
82 |
|
---|
83 | The plug-in should ask the user which method to use. The user must also tell the
|
---|
84 | plug-in among which items it should look for an item with a given 'name' or
|
---|
85 | 'externalID'. There are four options, and the user may select one or several
|
---|
86 | of them:
|
---|
87 |
|
---|
88 | * Owned by the logged in user
|
---|
89 | * Shared to the logged in user
|
---|
90 | * In the current project
|
---|
91 | * Owned by other users (only available if the logged in user has enough
|
---|
92 | permissions, eg. generic read permission for the item type)
|
---|
93 |
|
---|
94 | If the 'id' method is used, the above options are not used. When the plug-in
|
---|
95 | is looking for an item there are three possible outcomes.
|
---|
96 |
|
---|
97 | * No item is found. This can be handled in different ways:
|
---|
98 | - An error condition which aborts the plug-in
|
---|
99 | - The line is ignored
|
---|
100 | - A new item is created
|
---|
101 | * One item is found. This is the item that is going to be updated.
|
---|
102 | * More than one item is found. This can be handled in different ways:
|
---|
103 | - An error condition which aborts the plug-in
|
---|
104 | - The line is ignored
|
---|
105 |
|
---|
106 |
|
---|
107 | Parsing the data.
|
---|
108 | =================
|
---|
109 |
|
---|
110 | Simple properties
|
---|
111 | -----------------
|
---|
112 | We need to know the data type of each property as a Type object.
|
---|
113 | The string values can then be parsed with Type.parseString().
|
---|
114 |
|
---|
115 | Single-item references
|
---|
116 | -----------------------
|
---|
117 |
|
---|
118 | This is either the 'id', 'name', 'externalId' or another natural identifier
|
---|
119 | of the item. The plug-in should support those cases by a single column
|
---|
120 | mapping and an option to select which method to use.
|
---|
121 |
|
---|
122 | [NOTE]
|
---|
123 | This creates two input parameters for each columns which may be too many...
|
---|
124 | Alternative options are:
|
---|
125 |
|
---|
126 | * A global option for all item references. This doesn't give the user
|
---|
127 | any chance to use different method for different items.
|
---|
128 | * A global option with an 'auto' alternative that uses the 'id' method
|
---|
129 | for numeric values, otherwise first the 'name' and if no item is
|
---|
130 | found the 'externalId'.
|
---|
131 | * No options at all. The "best" method is selected by the plug-in depending
|
---|
132 | on the item that is going to be looked up and users are required to
|
---|
133 | follow this.
|
---|
134 |
|
---|
135 | [JH: I think one of the alternatives would be better than a plethoria
|
---|
136 | of parameters ... I like the auto idea but will it create though
|
---|
137 | conditions on names used on items?]
|
---|
138 |
|
---|
139 | ------
|
---|
140 |
|
---|
141 | When looking for item references the plug-in doesn't have to use the same
|
---|
142 | setting for 'owned by', 'shared to', etc as when looking for the main items.
|
---|
143 | In fact, this is not desired since many of those items are owned by the root
|
---|
144 | user or a system administrator. Eg. labels, software, hardware, etc. I don't
|
---|
145 | think it is practical to have another option for selecting this for each type
|
---|
146 | of item reference, so the default should be to look among all items the user
|
---|
147 | has access to (with use permission)...
|
---|
148 |
|
---|
149 | There are three outcomes:
|
---|
150 |
|
---|
151 | * No item is found.
|
---|
152 | - This can be an error condition that aborts the plug-in. If it is a
|
---|
153 | required property this will always happen.
|
---|
154 | - The link is ignored. No call is made to setABC() method. Note that this
|
---|
155 | case is different from having an empty column in which case
|
---|
156 | setABC(null) would be called.
|
---|
157 | * One item is found. This is the item we link to.
|
---|
158 | * Multiple items are found.
|
---|
159 | - This can be an error condition that aborts the plug-in.
|
---|
160 | - The link is ignored.
|
---|
161 |
|
---|
162 |
|
---|
163 | Multi-item references
|
---|
164 | ---------------------
|
---|
165 |
|
---|
166 | This should work in the same way as single item references.
|
---|
167 |
|
---|
168 |
|
---|
169 | Using the plug-in
|
---|
170 | =================
|
---|
171 |
|
---|
172 | Configuring the plug-in is done with the usual wizard. There will be plenty of
|
---|
173 | parameters so it is probably a good idea to use a multi-step wizard. This may have
|
---|
174 | to be tried out by actual users before we make any final decisions.
|
---|
175 |
|
---|
176 | Step 1
|
---|
177 | ------
|
---|
178 |
|
---|
179 | The user selects a file and enter values for the regular expressions and other
|
---|
180 | options for parsing the file. Column mappings are also specifiec in this
|
---|
181 | step. The "Test with file" function should be supported. Parameters that are
|
---|
182 | needed:
|
---|
183 |
|
---|
184 | * A file to parse
|
---|
185 | * Data header: Regular expression for finding the start of data
|
---|
186 | * Data splitter: Regular expression that splits data lines into columns
|
---|
187 | * Remove quotes: boolean option that removes "quotes" around values
|
---|
188 | * Ignore: Regular expression that matches lines to be ignored
|
---|
189 | * Data footer: Regular expression for finding the end of data
|
---|
190 | * Min/max data columns: The number of columns a data line must have, otherwise
|
---|
191 | it is ignored
|
---|
192 | * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file
|
---|
193 | * Decimal separator: if dot or comma is used as a decimal separator for numeric values
|
---|
194 | * Date format: The date format used in the file
|
---|
195 |
|
---|
196 | The above parameter are the same as those found in many of the existing import
|
---|
197 | plug-ins.
|
---|
198 |
|
---|
199 | Since each type of item has different properties, colum mapping parameters vary
|
---|
200 | from case to to case. Column mapping parameters may need to be divided into
|
---|
201 | subsections for clarity.
|
---|
202 |
|
---|
203 | For the ID property we need one column mapping parameter, one enum parameter
|
---|
204 | to select which identification method to use and boolean parameters for selecting
|
---|
205 | which items to search.
|
---|
206 |
|
---|
207 | For simple properties we need a single column mapping parameter.
|
---|
208 |
|
---|
209 | For single-item references we need one column mapping parameter
|
---|
210 | [and one enum parameter to select which identification method to use].
|
---|
211 |
|
---|
212 | All options (except the file to parse) in this step should also be available
|
---|
213 | to store as a plug-in configuration.
|
---|
214 |
|
---|
215 |
|
---|
216 | Step 2
|
---|
217 | ------
|
---|
218 |
|
---|
219 | This step is mainly about error handling options. Default values are marked with
|
---|
220 | *stars*.
|
---|
221 |
|
---|
222 | * Default error handling: *fail*, skip line
|
---|
223 | * Item not found: fail, *create*, skip line
|
---|
224 | * Multiple items found: *fail*, skip line
|
---|
225 | * Referenced item not found: *fail*, ignore, skip line
|
---|
226 | * Multiple referenced items found: *fail*, ignore, skip line
|
---|
227 | * Missing a required property: *fail*, skip line
|
---|
228 | * String too long: *fail*, crop, ignore
|
---|
229 | * Invalid numeric value: *fail*, null, ignore
|
---|
230 | * Numeric value out of range: *fail*, ignore
|
---|
231 | * A log file for detailed error messages
|
---|
232 | * A boolean parameter for selecting 'dry-run'
|
---|
233 |
|
---|
234 | If there are multi-item references the 'skip line' option above means that we
|
---|
235 | should skip all lines that are related to the same item.
|
---|
236 |
|
---|
237 | Implementation details
|
---|
238 | ======================
|
---|
239 |
|
---|
240 | We need some kind of basic, generic functionality that is handling the file parsing,
|
---|
241 | property mapping, item lookup, error handling, logging, etc.
|
---|
242 |
|
---|
243 | We also need functionality that is specific for each type of item the plug-in
|
---|
244 | should support. We need to know which properties that exists on the items. For each
|
---|
245 | property we need to know the data type or if the property is a single-item or a
|
---|
246 | multi-item reference. We need factory methods for creating new items etc.
|
---|
247 |
|
---|
248 | If possible, it should also be relatively easy to extend the item importer with
|
---|
249 | support for other item types in the future.
|
---|
250 |
|
---|
251 | One possible approach is to use an abstract base class for the common functionality.
|
---|
252 | This class defines some abstract methods that must be implemented by subclasses
|
---|
253 | where each subclass handles a single type of item. This approach makes it relatively
|
---|
254 | easy to add support for other item types just by creating a new subclass. This
|
---|
255 | approach creates a separate plug-in for each item type. Eg. SampleImporter,
|
---|
256 | ExtractImporter, etc.
|
---|
257 |
|
---|
258 |
|
---|
259 | Item specific functionality that is needed
|
---|
260 | ------------------------------------------
|
---|
261 |
|
---|
262 | * Which item lookup methods that are supported by the item. Name and id
|
---|
263 | will be supported by all items, some have external id, etc.
|
---|
264 |
|
---|
265 | * List of properties that can be imported. For each property we must know:
|
---|
266 | - if it is a simple value, a single-item reference or a multi-item reference
|
---|
267 | - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
|
---|
268 | - if the property is required or not
|
---|
269 | - name and description and other details for better user experience
|
---|
270 |
|
---|
271 | * A factory method for creating new items. Some items can be created without
|
---|
272 | any parameters (eg. Sample.getNew()), some requires one or more parameters
|
---|
273 | (eg. LabeledExtract.getNew(Label).
|
---|
274 |
|
---|
275 | * Find an item. This functionality is alredy implemented by the annotation importer,
|
---|
276 | but it is not very flexible since it uses reflection to find the getQuery() method.
|
---|
277 | The annotation importer only works with items were the getQuery() method doesn't
|
---|
278 | require any parameters.
|
---|
279 |
|
---|
280 | Supported item types
|
---|
281 | ====================
|
---|
282 |
|
---|
283 | See batchimport_userperspective.txt
|
---|