1 | =====================
|
---|
2 | Generic item importer
|
---|
3 | =====================
|
---|
4 |
|
---|
5 | This is a description of an import plug-in that can be used to import almost any
|
---|
6 | kind of item from a tab-separated text file (or compatible). The plug-in should
|
---|
7 | be able to create new items or update existing items. In both cases it should be
|
---|
8 | able to set values for:
|
---|
9 |
|
---|
10 | * Simple properties. Eg. string values, numeric values, dates, etc.
|
---|
11 | * Single-item references: Eg. protocol, label, software, etc.
|
---|
12 | * Multi-item references: Eg. the labeled extracts of a hybridization,
|
---|
13 | pooled samples, etc. In some cases a multi-item reference is bundled
|
---|
14 | with simple values. Eg. used quantity of a source biomaterial, the array
|
---|
15 | index a labeled extract is used on, etc. Multi-item references are never
|
---|
16 | removed by the importer, only added or updated. Removing an item from a
|
---|
17 | multi-item reference is a manual procedure to be done using the web
|
---|
18 | interface.
|
---|
19 |
|
---|
20 | The importer should not be able to set values for annotations since this is handled
|
---|
21 | by the already existing annotation importer plug-in. The annotation importer
|
---|
22 | and item importer should have similar behaviour and functionality to minimize
|
---|
23 | the learning cost for users.
|
---|
24 |
|
---|
25 | Key features
|
---|
26 | ------------
|
---|
27 |
|
---|
28 | The item importer is only expected to work on a single type of item at each use
|
---|
29 | and should read data from a single file.
|
---|
30 |
|
---|
31 | The user should be able to select if new items should be create and/or if existing
|
---|
32 | items should be updated.
|
---|
33 |
|
---|
34 | The importer should be able to work in 'dry-run' mode. Eg. everything is performed
|
---|
35 | as if a real import is taking place, but the work (transaction) is not committed
|
---|
36 | to the database.
|
---|
37 |
|
---|
38 | The importer should implement user-controlled error handling. Summary results,
|
---|
39 | eg. number of items imported, number of failed items, etc. should be kept track of
|
---|
40 | and reported as the job status message. More details about the error handling
|
---|
41 | options can be found later in this document.
|
---|
42 |
|
---|
43 | The plug-in should support logging more detailed error message. Eg. reasons for
|
---|
44 | failed items, line numbers, etc. to a separate file (in the BASE file system).
|
---|
45 | [IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
|
---|
46 | otherwise a complete failure will erase the file as well when the transaction
|
---|
47 | is rolled back.
|
---|
48 |
|
---|
49 | The plug-in should be configurable and be able to store file parsing settings,
|
---|
50 | including column mappings, etc. as plug-in configurations.
|
---|
51 |
|
---|
52 | File format
|
---|
53 | -----------
|
---|
54 |
|
---|
55 | The input file should be organised into columns separated by a specified character.
|
---|
56 | Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
|
---|
57 | file that can be parsed with the FlatFileParser class.
|
---|
58 |
|
---|
59 | The first data line contains the column headers which defines the contents of each
|
---|
60 | column. The column headers can be mapped to item properties at each use of the
|
---|
61 | plug-in or by saving predefined settings as a plug-in configuration. This also
|
---|
62 | includes separator character and other information that is needed to parse the
|
---|
63 | file. Saved configurations should implement auto-detection functionality.
|
---|
64 |
|
---|
65 | Data for a single item may be split onto multiple lines. The first line contains
|
---|
66 | simple properties, single-item references, and the first multi-item reference.
|
---|
67 | If there are more multi-item references they should be on the following lines and
|
---|
68 | the identifier column must have exactly the same value. Data in the columns
|
---|
69 | for simple properties and single-item references is ignored on multi-lines.
|
---|
70 | The multi-line entry ends as soon when a line with a different identifier is
|
---|
71 | found or when the file end is reached.
|
---|
72 |
|
---|
73 | When reading data for an item the plug-in needs to know if it should create a new
|
---|
74 | item or update an existing item. First, we need to know the method for identifying
|
---|
75 | items. Depending on the item type there are two or three options:
|
---|
76 |
|
---|
77 | * Using the internal 'id'. This is always unique.
|
---|
78 | * Using the 'name'. This may or may not be unique.
|
---|
79 | * Some items have an 'externalId'. This may or may not be unique.
|
---|
80 | * Array slides may have a 'barcode' which is similar to the externalId.
|
---|
81 |
|
---|
82 | Other items may have other properties that may be used for identification. It
|
---|
83 | would be good to implement the item lookup part in a way that makes it easy to
|
---|
84 | add new lookup methods when the need arises.
|
---|
85 |
|
---|
86 | The plug-in should ask the user which method to use. The user must also tell the
|
---|
87 | plug-in among which items it should look for an item with a given 'name' or
|
---|
88 | 'externalID'. There are four options, and the user may select one or several
|
---|
89 | of them:
|
---|
90 |
|
---|
91 | * Owned by the logged in user
|
---|
92 | * Shared to the logged in user
|
---|
93 | * In the current project
|
---|
94 | * Owned by other users (only available if the logged in user has enough
|
---|
95 | permissions, eg. generic read permission for the item type)
|
---|
96 |
|
---|
97 | If the 'id' method is used, the above options are not used. In all cases, the
|
---|
98 | plug-in will only consider items for which the logged in user has write
|
---|
99 | permission. When the plug-in is looking for an item there are three possible
|
---|
100 | outcomes.
|
---|
101 |
|
---|
102 | * No item is found. This can be handled in different ways:
|
---|
103 | - An error condition which aborts the plug-in
|
---|
104 | - The line is ignored
|
---|
105 | - A new item is created
|
---|
106 | * One item is found. This is the item that is going to be updated.
|
---|
107 | * More than one item is found. This can be handled in different ways:
|
---|
108 | - An error condition which aborts the plug-in
|
---|
109 | - The line is ignored
|
---|
110 |
|
---|
111 |
|
---|
112 | Parsing the data.
|
---|
113 | =================
|
---|
114 |
|
---|
115 | Simple properties
|
---|
116 | -----------------
|
---|
117 | Converting the string values from the file is the responsibility of some
|
---|
118 | item-specific code that knows what kind of values to expect in each data
|
---|
119 | column.
|
---|
120 |
|
---|
121 | Single-item references
|
---|
122 | -----------------------
|
---|
123 |
|
---|
124 | This is either the 'id', 'name', 'externalId' or another natural identifier
|
---|
125 | of the item. The plug-in selects a "best" option for each kind of item.
|
---|
126 | Typically, this means that a lookup by 'name' is tried first, and if
|
---|
127 | no item is found, try the 'externalId'. As a last resort and if the value
|
---|
128 | is numerical a lookup by 'internalId' is used.
|
---|
129 |
|
---|
130 | When looking for item references the plug-in doesn't have to use the same
|
---|
131 | setting for 'owned by', 'shared to', etc as when looking for the main items.
|
---|
132 | In fact, this is not desired since many of those items are owned by the root
|
---|
133 | user or a system administrator. Eg. labels, software, hardware, etc. I don't
|
---|
134 | think it is practical to have another option for selecting this for each type
|
---|
135 | of item reference, so the default should be to look among all items the user
|
---|
136 | has access to (with use permission).
|
---|
137 |
|
---|
138 | There are three outcomes:
|
---|
139 |
|
---|
140 | * No item is found.
|
---|
141 | - This can be an error condition that aborts the plug-in. If it is a
|
---|
142 | required property this will always happen.
|
---|
143 | - The link is ignored. No call is made to setABC() method. Note that this
|
---|
144 | case is different from having an empty column in which case
|
---|
145 | setABC(null) would be called.
|
---|
146 | * One item is found. This is the item we link to.
|
---|
147 | * Multiple items are found.
|
---|
148 | - This can be an error condition that aborts the plug-in.
|
---|
149 | - The link is ignored.
|
---|
150 |
|
---|
151 |
|
---|
152 | Multi-item references
|
---|
153 | ---------------------
|
---|
154 |
|
---|
155 | This should work in the same way as single item references.
|
---|
156 |
|
---|
157 |
|
---|
158 | Using the plug-in
|
---|
159 | =================
|
---|
160 |
|
---|
161 | Configuring the plug-in is done with the usual wizard. There will be plenty of
|
---|
162 | parameters so it is probably a good idea to use a multi-step wizard. This may have
|
---|
163 | to be tried out by actual users before we make any final decisions.
|
---|
164 |
|
---|
165 | Step 1
|
---|
166 | ------
|
---|
167 |
|
---|
168 | The user selects a file and enter values for the regular expressions and other
|
---|
169 | options for parsing the file. Column mappings are also specifiec in this
|
---|
170 | step. The "Test with file" function should be supported. Parameters that are
|
---|
171 | needed:
|
---|
172 |
|
---|
173 | * A file to parse
|
---|
174 | * Mode to use: create and/or update
|
---|
175 | * Data header: Regular expression for finding the start of data
|
---|
176 | * Data splitter: Regular expression that splits data lines into columns
|
---|
177 | * Remove quotes: boolean option that removes "quotes" around values
|
---|
178 | * Ignore: Regular expression that matches lines to be ignored
|
---|
179 | * Data footer: Regular expression for finding the end of data
|
---|
180 | * Min/max data columns: The number of columns a data line must have, otherwise
|
---|
181 | it is ignored
|
---|
182 | * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file
|
---|
183 | * Decimal separator: if dot or comma is used as a decimal separator for numeric values
|
---|
184 | * Date format: The date format used in the file
|
---|
185 |
|
---|
186 | The above parameter are the same as those found in many of the existing import
|
---|
187 | plug-ins.
|
---|
188 |
|
---|
189 | For item identification we need an enum parameter to select which identification
|
---|
190 | method to use and boolean parameters for selecting which items to search.
|
---|
191 |
|
---|
192 | Since each type of item has different properties, colum mapping parameters vary
|
---|
193 | from case to to case. Column mapping parameters may need to be divided into
|
---|
194 | subsections for clarity.
|
---|
195 |
|
---|
196 | * For simple properties we need a single column mapping parameter.
|
---|
197 |
|
---|
198 | * For single-item references we need one column mapping parameter.
|
---|
199 |
|
---|
200 | All options related to the file parsing should also be available to store as a
|
---|
201 | plug-in configuration.
|
---|
202 |
|
---|
203 |
|
---|
204 | Step 2
|
---|
205 | ------
|
---|
206 |
|
---|
207 | This step is mainly about error handling options. Default values are marked with
|
---|
208 | *stars*.
|
---|
209 |
|
---|
210 | * Default error handling: *fail*, skip line
|
---|
211 | * Item not found: fail, *create*, skip line
|
---|
212 | * Multiple items found: *fail*, skip line
|
---|
213 | * Referenced item not found: *fail*, ignore, skip line
|
---|
214 | * Multiple referenced items found: *fail*, ignore, skip line
|
---|
215 | * Missing a required property: *fail*, skip line
|
---|
216 | * String too long: *fail*, crop, ignore
|
---|
217 | * Invalid numeric value: *fail*, null, ignore
|
---|
218 | * Numeric value out of range: *fail*, ignore
|
---|
219 | * A log file for detailed error messages
|
---|
220 | * A boolean parameter for selecting 'dry-run'
|
---|
221 |
|
---|
222 | If there are multi-item references the 'skip line' option above means that we
|
---|
223 | should skip all lines that are related to the same item.
|
---|
224 |
|
---|
225 | Implementation details
|
---|
226 | ======================
|
---|
227 |
|
---|
228 | We need some kind of basic, generic functionality that is handling the file parsing,
|
---|
229 | property mapping, item lookup, error handling, logging, etc.
|
---|
230 |
|
---|
231 | We also need functionality that is specific for each type of item the plug-in
|
---|
232 | should support. We need to know which properties that exists on the items. For each
|
---|
233 | property we need to know the data type or if the property is a single-item or a
|
---|
234 | multi-item reference. We need factory methods for creating new items etc.
|
---|
235 |
|
---|
236 | If possible, it should also be relatively easy to extend the item importer with
|
---|
237 | support for other item types in the future.
|
---|
238 |
|
---|
239 | One possible approach is to use an abstract base class for the common functionality.
|
---|
240 | This class defines some abstract methods that must be implemented by subclasses
|
---|
241 | where each subclass handles a single type of item. This approach makes it relatively
|
---|
242 | easy to add support for other item types just by creating a new subclass. This
|
---|
243 | approach creates a separate plug-in for each item type. Eg. SampleImporter,
|
---|
244 | ExtractImporter, etc.
|
---|
245 |
|
---|
246 |
|
---|
247 | Item specific functionality that is needed
|
---|
248 | ------------------------------------------
|
---|
249 |
|
---|
250 | * Which item lookup methods that are supported by the item. Name and id
|
---|
251 | will be supported by all items, some have external id, etc.
|
---|
252 |
|
---|
253 | * List of properties that can be imported. For each property we must know:
|
---|
254 | - if it is a simple value, a single-item reference or a multi-item reference
|
---|
255 | - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
|
---|
256 | - if the property is required or not
|
---|
257 | - name and description and other details for better user experience
|
---|
258 |
|
---|
259 | * A factory method for creating new items. Some items can be created without
|
---|
260 | any parameters (eg. Sample.getNew()), some requires one or more parameters
|
---|
261 | (eg. LabeledExtract.getNew(Label).
|
---|
262 |
|
---|
263 | * Find an item. This functionality is alredy implemented by the annotation importer,
|
---|
264 | but it is not very flexible since it uses reflection to find the getQuery() method.
|
---|
265 | The annotation importer only works with items were the getQuery() method doesn't
|
---|
266 | require any parameters.
|
---|
267 |
|
---|
268 | Supported item types
|
---|
269 | ====================
|
---|
270 |
|
---|
271 | See batchimport_userperspective.txt
|
---|