Context Navigation

Back to Ticket #1028

Ticket #1028: batchimport-2.txt

File batchimport-2.txt, 12.5 KB (added by Jari Häkkinen, 16 years ago)

Line
1	=====================
2	Generic item importer
3	=====================
4
5	This is a description of an import plug-in that can be used to import almost any
6	kind of item from a tab-separated text file (or compatible). The plug-in should
7	be able to create new items or update existing items. In both cases it should be
8	able to set values for:
9
10	* Simple properties. Eg. string values, numeric values, dates, etc.
11	* Single-item references: Eg. protocol, label, software, owner, etc.
12	* Multi-item references: Eg. the labeled extracts of a hybridization,
13	pooled samples, etc. In some cases a multi-item reference is bundled
14	with simple values. Eg. used quantity of a source biomaterial, the array
15	index a labeled extract is used on, etc. Multi-item references are never
16	removed by the importer, only added or updated. Removing an item from a
17	multi-item reference is a manual procedure to be done using the web
18	interface.
19
20	The importer should not be able to set values for annotations since this is handled
21	by the already existing annotation importer plug-in. The annotation importer
22	and item importer should have similar behaviour and functionality to minimize
23	the learning cost for users.
24
25	Key features
26	------------
27
28	The item importer is only expected to work on a single type of item at each use
29	and should read data from a single file.
30
31	The importer should be able to work in 'dry-run' mode. Eg. everything is performed
32	as if a real import is taking place, but the work (transaction) is not committed
33	to the database.
34
35	The importer should implement user-controlled error handling. Summary results,
36	eg. number items importer, number of failed items, etc. should be kept track of
37	and reported as the job status message. More details about the error handling
38	options can be found later in this document.
39
40	The plug-in should support logging more detailed error message. Eg. reasons for
41	failed items, line numbers, etc. to a separate file (in the BASE file system).
42	[IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
43	otherwise a complete failure will erase the file as well when the transaction
44	is rolled back.
45
46	The plug-in should be configurable and be able to store file parsing settings,
47	including column mappings, etc. as plug-in configurations.
48
49	File format
50	-----------
51
52	The input file should be organised into columns separated by a specified character.
53	Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
54	file that can be parsed with the FlatFileParser class.
55
56	The first data line contains the column headers which defines the contents of each
57	column. The column headers can be mapped to item properties at each use of the
58	plug-in or by saving predefined settings as a plug-in configuration. This also
59	includes separator character and other information that is needed to parse the
60	file. Saved configurations should implement auto-detection functionality.
61
62	Data for a single item may be split onto multiple lines. The first line contains
63	simple properties and single-item references, and the first multi-item reference.
64	If there are more multi-item references they should be on the following lines with
65	empty values in all other columns, except for the column holding the item
66	identifier, which must have the same value on all lines. If the following
67	lines contains other data, this should be ignored, or it may be considered an
68	error condition. It may be caused by giving two items the same name by accident.
69
70	When reading data for an item the plug-in needs to know if it should create a new
71	item or update an existing item. First, we need to know the method for identifying
72	items. Depending on the item type there are two or three options:
73
74	* Using the internal 'id'. This is always unique.
75	* Using the 'name'. This may or may not be unique.
76	* Some items have an 'externalId'. This may or may not be unique.
77
78	The plug-in should ask the user which method to use. The user must also tell the
79	plug-in among which items it should look for an item with a given 'name' or
80	'externalID'. There are four options, and the user may select one or several
81	of them:
82
83	* Owned by the logged in user
84	* Shared to the logged in user
85	* In the current project
86	* Owned by other users (only available if the logged in user has enough
87	permissions, eg. generic read permission for the item type)
88
89	If the 'id' method is used, the above options are not used. When the plug-in
90	is looking for an item there are three possible outcomes.
91
92	* No item is found. This can be handled in different ways:
93	- An error condition which aborts the plug-in
94	- The line is ignored
95	- A new item is created
96	* One item is found. This is the item that is going to be updated.
97	* More than one item is found. This can be handled in different ways:
98	- An error condition which aborts the plug-in
99	- The line is ignored
100
101
102	Parsing the data.
103	=================
104
105	Simple properties
106	-----------------
107	We need to know the data type of each property as a Type object.
108	The string values can then be parsed with Type.parseString().
109
110	Single-item references
111	-----------------------
112
113	This is either the 'id', 'name' or 'externalId' of the item. The plug-in should
114	support those cases by a single column mapping and an option to select which
115	method to use.
116
117	[NOTE]
118	This creates two input parameters for each columns which may be too many...
119	Alternative options are:
120
121	* A global option for all item references. This doesn't give the user
122	any chance to use different method for different items.
123	* A global option with an 'auto' alternative that uses the 'id' method
124	for numeric values, otherwise first the 'name' and if no item is
125	found the 'externalId'.
126
127	[JH: I think one of the alternatives would be better than a plethoria
128	of parameters ... I like the auto idea but will it create though
129	conditions on names used on items?]
130
131	------
132
133	When looking for item references the plug-in doesn't have to use the same
134	setting for 'owned by', 'shared to', etc as when looking for the main items.
135	In fact, this is not desired since many of those items are owned by the root
136	user or a system administrator. Eg. labels, software, hardware, etc. I don't
137	think it is practical to have another option for selecting this for each type
138	of item reference, or....???
139	well... maybe the code can be prepared for it, but the default should be to
140	look among all items the user has access to...
141
142	[JH: I think we should settle for all items the user has access to and
143	ignore the setting of 'owned by', 'shared to', etc. As I understand it
144	then is that the filtering is only used to find items to be updated
145	(or created).]
146
147	In any case, there are three outcomes:
148
149	* No item is found.
150	- This can be an error condition that aborts the plug-in. If it is a
151	required property this will always happen.
152	- The link is ignored. No call is made to setABC() method. Note that this
153	case is different from having an empty column in which case
154	setABC(null) would be called.
155	* One item is found. This is the item we link to.
156	* Multiple items are found.
157	- This can be an error condition that aborts the plug-in.
158	- The link is ignored.
159
160
161	Multi-item references
162	---------------------
163
164	This should work in the same way as single item references.
165
166
167	Using the plug-in
168	=================
169
170	Configuring the plug-in is done with the usual wizard. There will be plenty of
171	parameters so it is probably a good idea to use a multi-step wizard. This may have
172	to be tried out by actual users before we make any final decisions.
173
174	Step 1
175	------
176
177	The user selects a file and enter values for the regular expressions and other
178	options for parsing the file. Column mappings are also specifiec in this
179	step. The "Test with file" function should be supported. Parameters that are
180	needed:
181
182	* A file to parse
183	* Data header: Regular expression for finding the start of data
184	* Data splitter: Regular expression that splits data lines into columns
185	* Remove quotes: boolean option that removes "quotes" around values
186	* Ignore: Regular expression that matches lines to be ignored
187	* Data footer: Regular expression for finding the end of data
188	* Min/max data columns: The number of columns a data line must have, otherwise
189	it is ignored
190	* Character set: The characeter set (eg. iso-8859-1, utf-8, etc.) used in the file
191	* Decimal separator: if dot or comma is used as a decimal separator for numeric values
192
193	The above parameter are the same as those found in many of the existing import
194	plug-ins.
195
196	Since each type of item has different properties, colum mapping parameters vary
197	from case to to case. Column mapping parameters may need to be divided into
198	subsections for clarity.
199
200	For the ID property we need one column mapping parameter, one enum parameter
201	to select which identification method to use and one enum parameter for selecting
202	which items to search (multi-choice).
203
204	For simple properties we need a single column mapping parameter.
205
206	For single-item references we need one column mapping parameter and one enum
207	parameter to select which identification method to use.
208
209	All options (except the file to parse) in this step should also be available
210	to store as a plug-in configuration. But in this case we need a 'step 0' which
211	asks us about which type of item the configuration is to be used with. Otherwise
212	we don't know which properties, etc. to provide column mappings for.
213
214	[JH: Isn't is possible to deduce the item type from the context in the
215	GUI? I mean, the user selects to use the batchimporter from some
216	context and the maybe this information could be passed on to the core.]
217
218	Step 2
219	------
220
221	This step is mainly about error handling options. Default values are marked with
222	stars.
223
224	* Default error handling: fail, skip line
225	* Item not found: fail, create, skip line
226	* Multiple items found: fail, skip line
227	* Referenced item not found: fail, ignore, skip line
228	* Multiple referenced items found: fail, ignore, skip line
229	* Missing a required property: fail, ignore if updating, skip line
230	* String too long: fail, crop, ignore
231	* Invalid numeric value: fail, ignore
232	* Numeric value out of range: fail, ignore
233	* A log file for detailed error messages
234
235	If there are multi-item references the 'skip line' option above means that we
236	should skip all lines that are related to the same item.
237
238	Implementation details
239	======================
240
241	We need some kind of basic, generic functionality that is handling the file parsing,
242	property mapping, item lookup, error handling, logging, etc.
243
244	We also need functionality that is specific for each type of item the plug-in
245	should support. We need to know which properties that exists on the items. For each
246	property we need to know the data type or if the property is a single-item or '
247	multi-item reference. We need factory methods for creating new items etc.
248
249	If possible, it should also be relatively easy to extend the item importer with
250	support for other item types in the future.
251
252	One possible approach is to use an abstract base class for the common functionality.
253	This class defines some abstract methods that must be implemented by subclasses
254	where each subclass handles a single type of item. This approach makes it relatively
255	easy to add support for other item types just by creating a new subclass. This
256	approach creates a separate plug-in for each item type. Eg. SampleImporter,
257	ExtractImporter, etc.
258
259
260	Item specific functionality that is needed
261	------------------------------------------
262
263	* List of properties that can be imported. For each property we must know:
264	- if it is a simple value, a single-item reference or a multi-item reference
265	- the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
266	- if the property is required or not
267	- name and description and other details for better user experience
268	This may be implemented as a 'Property' interface, with concrete implementations
269	for each type. The implementation should also know how to set a value on items.
270	Eg. NameProperty.setValue(item, theName) --> item.setName(theName). Note! some
271	properties may require multiple parameters. Eg. setting the source item and
272	used quantity: BioMaterialEvent.addSource(source, usedQuantity).
273
274	* A factory method for creating new items. Some items can be created without
275	any parameters (eg. Sample.getNew()), some requires one or more parameters
276	(eg. LabeledExtract.getNew(Label).
277
278	* Find an item. This functionality is alredy implemented by the annotation importer,
279	but it is not very flexible since it uses reflection to find the getQuery() method.
280	The annotation importer only works with items were the getQuery() method doesn't
281	require any parameters.
282
283	Supported item types
284	====================
285
286	TO BE DONE.

Download in other formats:

Original Format