1 | ===================== |
---|
2 | Generic item importer |
---|
3 | ===================== |
---|
4 | |
---|
5 | This is a description of an import plug-in that can be used to import almost any |
---|
6 | kind of item from a tab-separated text file (or compatible). The plug-in should |
---|
7 | be able to create new items or update existing items. In both cases it should be |
---|
8 | able to set values for: |
---|
9 | |
---|
10 | * Simple properties. Eg. string values, numeric values, dates, etc. |
---|
11 | * Single-item references: Eg. protocol, label, software, etc. |
---|
12 | * Multi-item references: Eg. the labeled extracts of a hybridization, |
---|
13 | pooled samples, etc. In some cases a multi-item reference is bundled |
---|
14 | with simple values. Eg. used quantity of a source biomaterial, the array |
---|
15 | index a labeled extract is used on, etc. Multi-item references are never |
---|
16 | removed by the importer, only added or updated. Removing an item from a |
---|
17 | multi-item reference is a manual procedure to be done using the web |
---|
18 | interface. |
---|
19 | |
---|
20 | The importer should not be able to set values for annotations since this is handled |
---|
21 | by the already existing annotation importer plug-in. The annotation importer |
---|
22 | and item importer should have similar behaviour and functionality to minimize |
---|
23 | the learning cost for users. |
---|
24 | |
---|
25 | Key features |
---|
26 | ------------ |
---|
27 | |
---|
28 | The item importer is only expected to work on a single type of item at each use |
---|
29 | and should read data from a single file. |
---|
30 | |
---|
31 | The user should be able to select if new items should be create and/or if existing |
---|
32 | items should be updated. |
---|
33 | |
---|
34 | The importer should be able to work in 'dry-run' mode. Eg. everything is performed |
---|
35 | as if a real import is taking place, but the work (transaction) is not committed |
---|
36 | to the database. |
---|
37 | |
---|
38 | The importer should implement user-controlled error handling. Summary results, |
---|
39 | eg. number of items imported, number of failed items, etc. should be kept track of |
---|
40 | and reported as the job status message. More details about the error handling |
---|
41 | options can be found later in this document. |
---|
42 | |
---|
43 | The plug-in should support logging more detailed error message. Eg. reasons for |
---|
44 | failed items, line numbers, etc. to a separate file (in the BASE file system). |
---|
45 | [IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction, |
---|
46 | otherwise a complete failure will erase the file as well when the transaction |
---|
47 | is rolled back. |
---|
48 | |
---|
49 | The plug-in should be configurable and be able to store file parsing settings, |
---|
50 | including column mappings, etc. as plug-in configurations. |
---|
51 | |
---|
52 | File format |
---|
53 | ----------- |
---|
54 | |
---|
55 | The input file should be organised into columns separated by a specified character. |
---|
56 | Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a |
---|
57 | file that can be parsed with the FlatFileParser class. |
---|
58 | |
---|
59 | The first data line contains the column headers which defines the contents of each |
---|
60 | column. The column headers can be mapped to item properties at each use of the |
---|
61 | plug-in or by saving predefined settings as a plug-in configuration. This also |
---|
62 | includes separator character and other information that is needed to parse the |
---|
63 | file. Saved configurations should implement auto-detection functionality. |
---|
64 | |
---|
65 | Data for a single item may be split onto multiple lines. The first line contains |
---|
66 | simple properties, single-item references, and the first multi-item reference. |
---|
67 | If there are more multi-item references they should be on the following lines and |
---|
68 | the identifier column must have exactly the same value. Data in the columns |
---|
69 | for simple properties and single-item references is ignored on multi-lines. |
---|
70 | The multi-line entry ends as soon when a line with a different identifier is |
---|
71 | found or when the file end is reached. |
---|
72 | |
---|
73 | When reading data for an item the plug-in needs to know if it should create a new |
---|
74 | item or update an existing item. First, we need to know the method for identifying |
---|
75 | items. Depending on the item type there are two or three options: |
---|
76 | |
---|
77 | * Using the internal 'id'. This is always unique. |
---|
78 | * Using the 'name'. This may or may not be unique. |
---|
79 | * Some items have an 'externalId'. This may or may not be unique. |
---|
80 | * Array slides may have a 'barcode' which is similar to the externalId. |
---|
81 | |
---|
82 | Other items may have other properties that may be used for identification. It |
---|
83 | would be good to implement the item lookup part in a way that makes it easy to |
---|
84 | add new lookup methods when the need arises. |
---|
85 | |
---|
86 | The plug-in should ask the user which method to use. The user must also tell the |
---|
87 | plug-in among which items it should look for an item with a given 'name' or |
---|
88 | 'externalID'. There are four options, and the user may select one or several |
---|
89 | of them: |
---|
90 | |
---|
91 | * Owned by the logged in user |
---|
92 | * Shared to the logged in user |
---|
93 | * In the current project |
---|
94 | * Owned by other users (only available if the logged in user has enough |
---|
95 | permissions, eg. generic read permission for the item type) |
---|
96 | |
---|
97 | If the 'id' method is used, the above options are not used. In all cases, the |
---|
98 | plug-in will only consider items for which the logged in user has write |
---|
99 | permission. When the plug-in is looking for an item there are three possible |
---|
100 | outcomes. |
---|
101 | |
---|
102 | * No item is found. This can be handled in different ways: |
---|
103 | - An error condition which aborts the plug-in |
---|
104 | - The line is ignored |
---|
105 | - A new item is created |
---|
106 | * One item is found. This is the item that is going to be updated. |
---|
107 | * More than one item is found. This can be handled in different ways: |
---|
108 | - An error condition which aborts the plug-in |
---|
109 | - The line is ignored |
---|
110 | |
---|
111 | |
---|
112 | Parsing the data. |
---|
113 | ================= |
---|
114 | |
---|
115 | Simple properties |
---|
116 | ----------------- |
---|
117 | Converting the string values from the file is the responsibility of some |
---|
118 | item-specific code that knows what kind of values to expect in each data |
---|
119 | column. |
---|
120 | |
---|
121 | Single-item references |
---|
122 | ----------------------- |
---|
123 | |
---|
124 | This is either the 'id', 'name', 'externalId' or another natural identifier |
---|
125 | of the item. The plug-in selects a "best" option for each kind of item. |
---|
126 | Typically, this means that a lookup by 'name' is tried first, and if |
---|
127 | no item is found, try the 'externalId'. As a last resort and if the value |
---|
128 | is numerical a lookup by 'internalId' is used. |
---|
129 | |
---|
130 | When looking for item references the plug-in doesn't have to use the same |
---|
131 | setting for 'owned by', 'shared to', etc as when looking for the main items. |
---|
132 | In fact, this is not desired since many of those items are owned by the root |
---|
133 | user or a system administrator. Eg. labels, software, hardware, etc. I don't |
---|
134 | think it is practical to have another option for selecting this for each type |
---|
135 | of item reference, so the default should be to look among all items the user |
---|
136 | has access to (with use permission). |
---|
137 | |
---|
138 | There are three outcomes: |
---|
139 | |
---|
140 | * No item is found. |
---|
141 | - This can be an error condition that aborts the plug-in. If it is a |
---|
142 | required property this will always happen. |
---|
143 | - The link is ignored. No call is made to setABC() method. Note that this |
---|
144 | case is different from having an empty column in which case |
---|
145 | setABC(null) would be called. |
---|
146 | * One item is found. This is the item we link to. |
---|
147 | * Multiple items are found. |
---|
148 | - This can be an error condition that aborts the plug-in. |
---|
149 | - The link is ignored. |
---|
150 | |
---|
151 | |
---|
152 | Multi-item references |
---|
153 | --------------------- |
---|
154 | |
---|
155 | This should work in the same way as single item references. |
---|
156 | |
---|
157 | |
---|
158 | Using the plug-in |
---|
159 | ================= |
---|
160 | |
---|
161 | Configuring the plug-in is done with the usual wizard. There will be plenty of |
---|
162 | parameters so it is probably a good idea to use a multi-step wizard. This may have |
---|
163 | to be tried out by actual users before we make any final decisions. |
---|
164 | |
---|
165 | Step 1 |
---|
166 | ------ |
---|
167 | |
---|
168 | The user selects a file and enter values for the regular expressions and other |
---|
169 | options for parsing the file. Column mappings are also specifiec in this |
---|
170 | step. The "Test with file" function should be supported. Parameters that are |
---|
171 | needed: |
---|
172 | |
---|
173 | * A file to parse |
---|
174 | * Mode to use: create and/or update |
---|
175 | * Data header: Regular expression for finding the start of data |
---|
176 | * Data splitter: Regular expression that splits data lines into columns |
---|
177 | * Remove quotes: boolean option that removes "quotes" around values |
---|
178 | * Ignore: Regular expression that matches lines to be ignored |
---|
179 | * Data footer: Regular expression for finding the end of data |
---|
180 | * Min/max data columns: The number of columns a data line must have, otherwise |
---|
181 | it is ignored |
---|
182 | * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file |
---|
183 | * Decimal separator: if dot or comma is used as a decimal separator for numeric values |
---|
184 | * Date format: The date format used in the file |
---|
185 | |
---|
186 | The above parameter are the same as those found in many of the existing import |
---|
187 | plug-ins. |
---|
188 | |
---|
189 | For item identification we need an enum parameter to select which identification |
---|
190 | method to use and boolean parameters for selecting which items to search. |
---|
191 | |
---|
192 | Since each type of item has different properties, colum mapping parameters vary |
---|
193 | from case to to case. Column mapping parameters may need to be divided into |
---|
194 | subsections for clarity. |
---|
195 | |
---|
196 | * For simple properties we need a single column mapping parameter. |
---|
197 | |
---|
198 | * For single-item references we need one column mapping parameter. |
---|
199 | |
---|
200 | All options related to the file parsing should also be available to store as a |
---|
201 | plug-in configuration. |
---|
202 | |
---|
203 | |
---|
204 | Step 2 |
---|
205 | ------ |
---|
206 | |
---|
207 | This step is mainly about error handling options. Default values are marked with |
---|
208 | *stars*. |
---|
209 | |
---|
210 | * Default error handling: *fail*, skip line |
---|
211 | * Item not found: fail, *create*, skip line |
---|
212 | * Multiple items found: *fail*, skip line |
---|
213 | * Referenced item not found: *fail*, ignore, skip line |
---|
214 | * Multiple referenced items found: *fail*, ignore, skip line |
---|
215 | * Missing a required property: *fail*, skip line |
---|
216 | * String too long: *fail*, crop, ignore |
---|
217 | * Invalid numeric value: *fail*, null, ignore |
---|
218 | * Numeric value out of range: *fail*, ignore |
---|
219 | * A log file for detailed error messages |
---|
220 | * A boolean parameter for selecting 'dry-run' |
---|
221 | |
---|
222 | If there are multi-item references the 'skip line' option above means that we |
---|
223 | should skip all lines that are related to the same item. |
---|
224 | |
---|
225 | Implementation details |
---|
226 | ====================== |
---|
227 | |
---|
228 | We need some kind of basic, generic functionality that is handling the file parsing, |
---|
229 | property mapping, item lookup, error handling, logging, etc. |
---|
230 | |
---|
231 | We also need functionality that is specific for each type of item the plug-in |
---|
232 | should support. We need to know which properties that exists on the items. For each |
---|
233 | property we need to know the data type or if the property is a single-item or a |
---|
234 | multi-item reference. We need factory methods for creating new items etc. |
---|
235 | |
---|
236 | If possible, it should also be relatively easy to extend the item importer with |
---|
237 | support for other item types in the future. |
---|
238 | |
---|
239 | One possible approach is to use an abstract base class for the common functionality. |
---|
240 | This class defines some abstract methods that must be implemented by subclasses |
---|
241 | where each subclass handles a single type of item. This approach makes it relatively |
---|
242 | easy to add support for other item types just by creating a new subclass. This |
---|
243 | approach creates a separate plug-in for each item type. Eg. SampleImporter, |
---|
244 | ExtractImporter, etc. |
---|
245 | |
---|
246 | |
---|
247 | Item specific functionality that is needed |
---|
248 | ------------------------------------------ |
---|
249 | |
---|
250 | * Which item lookup methods that are supported by the item. Name and id |
---|
251 | will be supported by all items, some have external id, etc. |
---|
252 | |
---|
253 | * List of properties that can be imported. For each property we must know: |
---|
254 | - if it is a simple value, a single-item reference or a multi-item reference |
---|
255 | - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc. |
---|
256 | - if the property is required or not |
---|
257 | - name and description and other details for better user experience |
---|
258 | |
---|
259 | * A factory method for creating new items. Some items can be created without |
---|
260 | any parameters (eg. Sample.getNew()), some requires one or more parameters |
---|
261 | (eg. LabeledExtract.getNew(Label). |
---|
262 | |
---|
263 | * Find an item. This functionality is alredy implemented by the annotation importer, |
---|
264 | but it is not very flexible since it uses reflection to find the getQuery() method. |
---|
265 | The annotation importer only works with items were the getQuery() method doesn't |
---|
266 | require any parameters. |
---|
267 | |
---|
268 | Supported item types |
---|
269 | ==================== |
---|
270 | |
---|
271 | See batchimport_userperspective.txt |
---|