1 | =====================
|
---|
2 | Generic item importer
|
---|
3 | =====================
|
---|
4 |
|
---|
5 | This is a description of an import plug-in that can be used to import almost any
|
---|
6 | kind of item from a tab-separated text file (or compatible). The plug-in should
|
---|
7 | be able to create new items or update existing items. In both cases it should be
|
---|
8 | able to set values for:
|
---|
9 |
|
---|
10 | * Simple properties. Eg. string values, numeric values, dates, etc.
|
---|
11 | * Single-item references: Eg. protocol, label, software, owner, etc.
|
---|
12 | * Multi-item references: Eg. the labeled extracts of a hybridization,
|
---|
13 | pooled samples, etc. In some cases a multi-item reference is bundled
|
---|
14 | with simple values. Eg. used quantity of a source biomaterial, the array
|
---|
15 | index a labeled extract is used on, etc. Multi-item references are never
|
---|
16 | removed by the importer, only added or updated. Removing an item from a
|
---|
17 | multi-item reference is a manual procedure to be done using the web
|
---|
18 | interface.
|
---|
19 |
|
---|
20 | The importer should not be able to set values for annotations since this is handled
|
---|
21 | by the already existing annotation importer plug-in. The annotation importer
|
---|
22 | and item importer should have similar behaviour and functionality to minimize
|
---|
23 | the learning cost for users.
|
---|
24 |
|
---|
25 | Key features
|
---|
26 | ------------
|
---|
27 |
|
---|
28 | The item importer is only expected to work on a single type of item at each use
|
---|
29 | and should read data from a single file.
|
---|
30 |
|
---|
31 | The importer should be able to work in 'dry-run' mode. Eg. everything is performed
|
---|
32 | as if a real import is taking place, but the work (transaction) is not committed
|
---|
33 | to the database.
|
---|
34 |
|
---|
35 | The importer should implement user-controlled error handling. Summary results,
|
---|
36 | eg. number items importer, number of failed items, etc. should be kept track of
|
---|
37 | and reported as the job status message. More details about the error handling
|
---|
38 | options can be found later in this document.
|
---|
39 |
|
---|
40 | The plug-in should support logging more detailed error message. Eg. reasons for
|
---|
41 | failed items, line numbers, etc. to a separate file (in the BASE file system).
|
---|
42 | [IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
|
---|
43 | otherwise a complete failure will erase the file as well when the transaction
|
---|
44 | is rolled back.
|
---|
45 |
|
---|
46 | The plug-in should be configurable and be able to store file parsing settings,
|
---|
47 | including column mappings, etc. as plug-in configurations.
|
---|
48 |
|
---|
49 | File format
|
---|
50 | -----------
|
---|
51 |
|
---|
52 | The input file should be organised into columns separated by a specified character.
|
---|
53 | Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
|
---|
54 | file that can be parsed with the FlatFileParser class.
|
---|
55 |
|
---|
56 | The first data line contains the column headers which defines the contents of each
|
---|
57 | column. The column headers can be mapped to item properties at each use of the
|
---|
58 | plug-in or by saving predefined settings as a plug-in configuration. This also
|
---|
59 | includes separator character and other information that is needed to parse the
|
---|
60 | file. Saved configurations should implement auto-detection functionality.
|
---|
61 |
|
---|
62 | Data for a single item may be split onto multiple lines. The first line contains
|
---|
63 | simple properties and single-item references, and the first multi-item reference.
|
---|
64 | If there are more multi-item references they should be on the following lines with
|
---|
65 | empty values in all other columns, except for the column holding the item
|
---|
66 | identifier, which must have the same value on all lines. If the following
|
---|
67 | lines contains other data, this should be ignored, or it may be considered an
|
---|
68 | error condition. It may be caused by giving two items the same name by accident.
|
---|
69 |
|
---|
70 | When reading data for an item the plug-in needs to know if it should create a new
|
---|
71 | item or update an existing item. First, we need to know the method for identifying
|
---|
72 | items. Depending on the item type there are two or three options:
|
---|
73 |
|
---|
74 | * Using the internal 'id'. This is always unique.
|
---|
75 | * Using the 'name'. This may or may not be unique.
|
---|
76 | * Some items have an 'externalId'. This may or may not be unique.
|
---|
77 |
|
---|
78 | The plug-in should ask the user which method to use. The user must also tell the
|
---|
79 | plug-in among which items it should look for an item with a given 'name' or
|
---|
80 | 'externalID'. There are four options, and the user may select one or several
|
---|
81 | of them:
|
---|
82 |
|
---|
83 | * Owned by the logged in user
|
---|
84 | * Shared to the logged in user
|
---|
85 | * In the current project
|
---|
86 | * Owned by other users (only available if the logged in user has enough
|
---|
87 | permissions, eg. generic read permission for the item type)
|
---|
88 |
|
---|
89 | If the 'id' method is used, the above options are not used. When the plug-in
|
---|
90 | is looking for an item there are three possible outcomes.
|
---|
91 |
|
---|
92 | * No item is found. This can be handled in different ways:
|
---|
93 | - An error condition which aborts the plug-in
|
---|
94 | - The line is ignored
|
---|
95 | - A new item is created
|
---|
96 | * One item is found. This is the item that is going to be updated.
|
---|
97 | * More than one item is found. This can be handled in different ways:
|
---|
98 | - An error condition which aborts the plug-in
|
---|
99 | - The line is ignored
|
---|
100 |
|
---|
101 |
|
---|
102 | Parsing the data.
|
---|
103 | =================
|
---|
104 |
|
---|
105 | Simple properties
|
---|
106 | -----------------
|
---|
107 | We need to know the data type of each property as a Type object.
|
---|
108 | The string values can then be parsed with Type.parseString().
|
---|
109 |
|
---|
110 | Single-item references
|
---|
111 | -----------------------
|
---|
112 |
|
---|
113 | This is either the 'id', 'name' or 'externalId' of the item. The plug-in should
|
---|
114 | support those cases by a single column mapping and an option to select which
|
---|
115 | method to use.
|
---|
116 |
|
---|
117 | [NOTE]
|
---|
118 | This creates two input parameters for each columns which may be too many...
|
---|
119 | Alternative options are:
|
---|
120 |
|
---|
121 | * A global option for all item references. This doesn't give the user
|
---|
122 | any chance to use different method for different items.
|
---|
123 | * A global option with an 'auto' alternative that uses the 'id' method
|
---|
124 | for numeric values, otherwise first the 'name' and if no item is
|
---|
125 | found the 'externalId'.
|
---|
126 | ------
|
---|
127 |
|
---|
128 | When looking for item references the plug-in doesn't have to use the same
|
---|
129 | setting for 'owned by', 'shared to', etc as when looking for the main items.
|
---|
130 | In fact, this is not desired since many of those items are owned by the root
|
---|
131 | user or a system administrator. Eg. labels, software, hardware, etc. I don't
|
---|
132 | think it is practical to have another option for selecting this for each type
|
---|
133 | of item reference, or....???
|
---|
134 | well... maybe the code can be prepared for it, but the default should be to
|
---|
135 | look among all items the user has access to...
|
---|
136 |
|
---|
137 | In any case, there are three outcomes:
|
---|
138 |
|
---|
139 | * No item is found.
|
---|
140 | - This can be an error condition that aborts the plug-in. If it is a
|
---|
141 | required property this will always happen.
|
---|
142 | - The link is ignored. No call is made to setABC() method. Note that this
|
---|
143 | case is different from having an empty column in which case
|
---|
144 | setABC(null) would be called.
|
---|
145 | * One item is found. This is the item we link to.
|
---|
146 | * Multiple items are found.
|
---|
147 | - This can be an error condition that aborts the plug-in.
|
---|
148 | - The link is ignored.
|
---|
149 |
|
---|
150 |
|
---|
151 | Multi-item references
|
---|
152 | ---------------------
|
---|
153 |
|
---|
154 | This should work in the same way as single item references.
|
---|
155 |
|
---|
156 |
|
---|
157 | Using the plug-in
|
---|
158 | =================
|
---|
159 |
|
---|
160 | Configuring the plug-in is done with the usual wizard. There will be plenty of
|
---|
161 | parameters so it is probably a good idea to use a multi-step wizard. This may have
|
---|
162 | to be tried out by actual users before we make any final decisions.
|
---|
163 |
|
---|
164 | Step 1
|
---|
165 | ------
|
---|
166 |
|
---|
167 | The user selects a file and enter values for the regular expressions and other
|
---|
168 | options for parsing the file. Column mappings are also specifiec in this
|
---|
169 | step. The "Test with file" function should be supported. Parameters that are
|
---|
170 | needed:
|
---|
171 |
|
---|
172 | * A file to parse
|
---|
173 | * Data header: Regular expression for finding the start of data
|
---|
174 | * Data splitter: Regular expression that splits data lines into columns
|
---|
175 | * Remove quotes: boolean option that removes "quotes" around values
|
---|
176 | * Ignore: Regular expression that matches lines to be ignored
|
---|
177 | * Data footer: Regular expression for finding the end of data
|
---|
178 | * Min/max data columns: The number of columns a data line must have, otherwise
|
---|
179 | it is ignored
|
---|
180 | * Character set: The characeter set (eg. iso-8859-1, utf-8, etc.) used in the file
|
---|
181 | * Decimal separator: if dot or comma is used as a decimal separator for numeric values
|
---|
182 |
|
---|
183 | The above parameter are the same as thos found in many of the existing import
|
---|
184 | plug-ins.
|
---|
185 |
|
---|
186 | Since each type of item has different properties, colum mapping parameters vary
|
---|
187 | from case to to case. Column mapping parameters may need to be divided into
|
---|
188 | subsections for clarity.
|
---|
189 |
|
---|
190 | For the ID property we need one column mapping parameter, one enum parameter
|
---|
191 | to select which identification method to use and one enum parameter for selecting
|
---|
192 | which items to search (multi-choice).
|
---|
193 |
|
---|
194 | For simple properties we need a single column mapping parameter.
|
---|
195 |
|
---|
196 | For single-item references we need one column mapping parameter and one enum
|
---|
197 | parameter to select which identification method to use.
|
---|
198 |
|
---|
199 | All options (except the file to parse) in this step should also be available
|
---|
200 | to store as a plug-in configuration. But in this case we need a 'step 0' which
|
---|
201 | asks us about which type of item the configuration is to be used with. Otherwise
|
---|
202 | we don't know which properties, etc. to provide column mappings for.
|
---|
203 |
|
---|
204 |
|
---|
205 | Step 2
|
---|
206 | ------
|
---|
207 |
|
---|
208 | This step is mainly about error handling options. Default values are marked with
|
---|
209 | *stars*.
|
---|
210 |
|
---|
211 | * Default error handling: *fail*, skip line
|
---|
212 | * Item not found: fail, *create*, skip line
|
---|
213 | * Multiple items found: *fail*, skip line
|
---|
214 | * Referenced item not found: *fail*, ignore, skip line
|
---|
215 | * Multiple referenced items found: *fail*, ignore, skip line
|
---|
216 | * Missing a required property: *fail*, ignore if updating, skip line
|
---|
217 | * String too long: *fail*, crop, ignore
|
---|
218 | * Invalid numeric value: *fail*, ignore
|
---|
219 | * Numeric value out of range: *fail*, ignore
|
---|
220 | * A log file for detailed error messages
|
---|
221 |
|
---|
222 | If there are multi-item references the 'skip line' option above means that we
|
---|
223 | should skip all lines that are related to the same item.
|
---|
224 |
|
---|
225 | Implementation details
|
---|
226 | ======================
|
---|
227 |
|
---|
228 | We need some kind of basic, generic functionality that is handling the file parsing,
|
---|
229 | property mapping, item lookup, error handling, logging, etc.
|
---|
230 |
|
---|
231 | We also need functionality that is specific for each type of item the plug-in
|
---|
232 | should support. We need to know which properties that exists on the items. For each
|
---|
233 | property we need to know the data type or if the property is a single-item or '
|
---|
234 | multi-item reference. We need factory methods for creating new items etc.
|
---|
235 |
|
---|
236 | If possible, it should also be relatively easy to extend the item importer with
|
---|
237 | support for other item types in the future.
|
---|
238 |
|
---|
239 | One possible approach is to use an abstract base class for the common functionality.
|
---|
240 | This class defines some abstract methods that must be implemented by subclasses
|
---|
241 | where each subclass handles a single type of item. This approach makes it relatively
|
---|
242 | easy to add support for other item types just by creating a new subclass. This
|
---|
243 | approach creates a separate plug-in for each item type. Eg. SampleImporter,
|
---|
244 | ExtractImporter, etc.
|
---|
245 |
|
---|
246 |
|
---|
247 | Item specific functionality that is needed
|
---|
248 | ------------------------------------------
|
---|
249 |
|
---|
250 | * List of properties that can be imported. For each property we must know:
|
---|
251 | - if it is a simple value, a single-item reference or a multi-item reference
|
---|
252 | - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
|
---|
253 | - if the property is required or not
|
---|
254 | - name and description and other details for better user experience
|
---|
255 | This may be implemented as a 'Property' interface, with concrete implementations
|
---|
256 | for each type. The implementation should also know how to set a value on items.
|
---|
257 | Eg. NameProperty.setValue(item, theName) --> item.setName(theName). Note! some
|
---|
258 | properties may require multiple parameters. Eg. setting the source item and
|
---|
259 | used quantity: BioMaterialEvent.addSource(source, usedQuantity).
|
---|
260 |
|
---|
261 | * A factory method for creating new items. Some items can be created without
|
---|
262 | any parameters (eg. Sample.getNew()), some requires one or more parameters
|
---|
263 | (eg. LabeledExtract.getNew(Label).
|
---|
264 |
|
---|
265 | * Find an item. This functionality is alredy implemented by the annotation importer,
|
---|
266 | but it is not very flexible since it uses reflection to find the getQuery() method.
|
---|
267 | The annotation importer only works with items were the getQuery() method doesn't
|
---|
268 | require any parameters.
|
---|
269 |
|
---|
270 | Supported item types
|
---|
271 | ====================
|
---|
272 |
|
---|
273 | TO BE DONE.
|
---|