=====================
Generic item importer
=====================

This is a description of an import plug-in that can be used to import almost any 
kind of item from a tab-separated text file (or compatible). The plug-in should 
be able to create new items or update existing items. In both cases it should be 
able to set values for:

 * Simple properties. Eg. string values, numeric values, dates, etc.
 * Single-item references: Eg. protocol, label, software, etc.
 * Multi-item references: Eg. the labeled extracts of a hybridization, 
   pooled samples, etc. In some cases a multi-item reference is bundled
   with simple values. Eg. used quantity of a source biomaterial, the array 
   index a labeled extract is used on, etc. Multi-item references are never
   removed by the importer, only added or updated. Removing an item from a 
   multi-item reference is a manual procedure to be done using the web
   interface.

The importer should not be able to set values for annotations since this is handled
by the already existing annotation importer plug-in. The annotation importer
and item importer should have similar behaviour and functionality to minimize
the learning cost for users.

Key features
------------

The item importer is only expected to work on a single type of item at each use 
and should read data from a single file.

The user should be able to select if new items should be create and/or if existing
items should be updated.

The importer should be able to work in 'dry-run' mode. Eg. everything is performed
as if a real import is taking place, but the work (transaction) is not committed 
to the database.

The importer should implement user-controlled error handling. Summary results,
eg. number of items imported, number of failed items, etc. should be kept track of
and reported as the job status message. More details about the error handling
options can be found later in this document.
 
The plug-in should support logging more detailed error message. Eg. reasons for 
failed items, line numbers, etc. to a separate file (in the BASE file system).
[IMPLEMENTATION NOTE] This file needs to be handled in a separate transaction,
otherwise a complete failure will erase the file as well when the transaction
is rolled back.

The plug-in should be configurable and be able to store file parsing settings, 
including column mappings, etc. as plug-in configurations.

File format
-----------

The input file should be organised into columns separated by a specified character. 
Eg. tab, comma, etc. Fixed-width columns are not supported. Eg. the file is a
file that can be parsed with the FlatFileParser class. 

The first data line contains the column headers which defines the contents of each
column. The column headers can be mapped to item properties at each use of the 
plug-in or by saving predefined settings as a plug-in configuration. This also 
includes separator character and other information that is needed to parse the 
file. Saved configurations should implement auto-detection functionality.

Data for a single item may be split onto multiple lines. The first line contains
simple properties, single-item references, and the first multi-item reference. 
If there are more multi-item references they should be on the following lines and
the identifier column must have exactly the same value. Data in the columns
for simple properties and single-item references is ignored on multi-lines.
The multi-line entry ends as soon when a line with a different identifier is
found or when the file end is reached.

When reading data for an item the plug-in needs to know if it should create a new 
item or update an existing item. First, we need to know the method for identifying
items. Depending on the item type there are two or three options:

 * Using the internal 'id'. This is always unique.
 * Using the 'name'. This may or may not be unique.
 * Some items have an 'externalId'. This may or may not be unique.
 * Array slides may have a 'barcode' which is similar to the externalId.

Other items may have other properties that may be used for identification. It 
would be good to implement the item lookup part in a way that makes it easy to
add new lookup methods when the need arises.
 
The plug-in should ask the user which method to use. The user must also tell the 
plug-in among which items it should look for an item with a given 'name' or 
'externalID'. There are four options, and the user may select one or several
of them:

 * Owned by the logged in user
 * Shared to the logged in user
 * In the current project
 * Owned by other users (only available if the logged in user has enough
   permissions, eg. generic read permission for the item type)

If the 'id' method is used, the above options are not used. In all cases, the
plug-in will only consider items for which the logged in user has write 
permission. When the plug-in is looking for an item there are three possible 
outcomes.

 * No item is found. This can be handled in different ways:
   - An error condition which aborts the plug-in
   - The line is ignored
   - A new item is created
 * One item is found. This is the item that is going to be updated.
 * More than one item is found. This can be handled in different ways:
   - An error condition which aborts the plug-in
   - The line is ignored


Parsing the data.
=================

Simple properties
-----------------
Converting the string values from the file is the responsibility of some
item-specific code that knows what kind of values to expect in each data
column.

Single-item references
-----------------------

This is either the 'id', 'name', 'externalId' or another natural identifier 
of the item. The plug-in selects a "best" option for each kind of item. 
Typically, this means that a lookup by 'name' is tried first, and if
no item is found, try the 'externalId'. As a last resort and if the value
is numerical a lookup by 'internalId' is used.

When looking for item references the plug-in doesn't have to use the same
setting for 'owned by', 'shared to', etc as when looking for the main items. 
In fact, this is not desired since many of those items are owned by the root 
user or a system administrator. Eg. labels, software, hardware, etc. I don't 
think it is practical to have another option for selecting this for each type 
of item reference, so the default should be to look among all items the user 
has access to (with use permission).

There are three outcomes:

 * No item is found. 
   - This can be an error condition that aborts the plug-in. If it is a 
     required property this will always happen.
   - The link is ignored. No call is made to setABC() method. Note that this
     case is different from having an empty column in which case
     setABC(null) would be called.
 * One item is found. This is the item we link to.
 * Multiple items are found.
   - This can be an error condition that aborts the plug-in.
   - The link is ignored.

    
Multi-item references
---------------------

This should work in the same way as single item references. 


Using the plug-in
=================

Configuring the plug-in is done with the usual wizard. There will be plenty of
parameters so it is probably a good idea to use a multi-step wizard. This may have
to be tried out by actual users before we make any final decisions.

Step 1
------

The user selects a file and enter values for the regular expressions and other 
options for parsing the file. Column mappings are also specifiec in this
step. The "Test with file" function should be supported. Parameters that are
needed:

 * A file to parse
 * Mode to use: create and/or update
 * Data header: Regular expression for finding the start of data
 * Data splitter: Regular expression that splits data lines into columns
 * Remove quotes: boolean option that removes "quotes" around values
 * Ignore: Regular expression that matches lines to be ignored
 * Data footer: Regular expression for finding the end of data
 * Min/max data columns: The number of columns a data line must have, otherwise
   it is ignored
 * Character set: The character set (eg. iso-8859-1, utf-8, etc.) used in the file
 * Decimal separator: if dot or comma is used as a decimal separator for numeric values
 * Date format: The date format used in the file

The above parameter are the same as those found in many of the existing import
plug-ins.
 
For item identification we need an enum parameter to select which identification 
method to use and boolean parameters for selecting which items to search. 

Since each type of item has different properties, colum mapping parameters vary
from case to to case. Column mapping parameters may need to be divided into 
subsections for clarity.

 * For simple properties we need a single column mapping parameter.

 * For single-item references we need one column mapping parameter.
 
All options related to the file parsing should also be available to store as a 
plug-in configuration. 


Step 2
------

This step is mainly about error handling options. Default values are marked with
*stars*.

 * Default error handling: *fail*, skip line
 * Item not found: fail, *create*, skip line
 * Multiple items found: *fail*, skip line
 * Referenced item not found: *fail*, ignore, skip line
 * Multiple referenced items found: *fail*, ignore, skip line
 * Missing a required property: *fail*, skip line
 * String too long: *fail*, crop, ignore
 * Invalid numeric value: *fail*, null, ignore
 * Numeric value out of range: *fail*, ignore
 * A log file for detailed error messages
 * A boolean parameter for selecting 'dry-run'

If there are multi-item references the 'skip line' option above means that we 
should skip all lines that are related to the same item.
 
Implementation details
======================

We need some kind of basic, generic functionality that is handling the file parsing,
property mapping, item lookup, error handling, logging, etc.

We also need functionality that is specific for each type of item the plug-in 
should support. We need to know which properties that exists on the items. For each
property we need to know the data type or if the property is a single-item or a
multi-item reference. We need factory methods for creating new items etc.

If possible, it should also be relatively easy to extend the item importer with 
support for other item types in the future.

One possible approach is to use an abstract base class for the common functionality.
This class defines some abstract methods that must be implemented by subclasses 
where each subclass handles a single type of item. This approach makes it relatively
easy to add support for other item types just by creating a new subclass. This 
approach creates a separate plug-in for each item type. Eg. SampleImporter, 
ExtractImporter, etc. 


Item specific functionality that is needed
------------------------------------------
 
 * Which item lookup methods that are supported by the item. Name and id
   will be supported by all items, some have external id, etc.

 * List of properties that can be imported. For each property we must know: 
   - if it is a simple value, a single-item reference or a multi-item reference
   - the data type, eg. string, float, int, SAMPLE, PROTOCOL, etc.
   - if the property is required or not
   - name and description and other details for better user experience
 
 * A factory method for creating new items. Some items can be created without
   any parameters (eg. Sample.getNew()), some requires one or more parameters
   (eg. LabeledExtract.getNew(Label).

 * Find an item. This functionality is alredy implemented by the annotation importer,
   but it is not very flexible since it uses reflection to find the getQuery() method.
   The annotation importer only works with items were the getQuery() method doesn't
   require any parameters.

Supported item types 
====================

See batchimport_userperspective.txt