Generic multi-importer for raw data =================================== What it is ---------- A plug-in that can import data to multiple raw bioassays in one go. The plug-in doesn't do the data import itself, but uses already existing plug-ins that import raw data to a single raw bioassay at a time. The plug-in can only import data of a single type at a time. This means that the actual plug-in/file format that performs the data import must be the same for all raw bioassays that are imported in a single session. We believe that this limitation is not a very severe one, since in the typical use case there is only a limited number of data types/file formats present in a experiment. Most experiments probably only have one, and a dye-swap experiment two. In the text below we will call the actual import plug-in for the "worker plug-in". When and were to use the plug-in -------------------------------- The importer should be an import-type plug-in that works from the single-item view of an experiment. Eg. GuiContext = EXPERIMENT, ITEM It should only work for platforms that supports importing raw data into the database. Eg. isInContext() should check: Experiment.getRawDataType().isStoredInDb() == true. It should also check that at least one of the raw bioassays in the experiment doesn't have imported data already, and that there are files attached that it is possible to use for the raw data import. Parameter input --------------- Step 1. The first step is to select which of the raw bioassays that doesn't have raw data that we should import raw data to. The requirement is that the raw bioassays doesn't already have raw data and has files attached to them. In this step the user should also select if the a worker plug-in and/or file format should be selected manually or by trying to auto-detect a suitable file format. Step 2. In manual selection mode, the user is allowed to select a worker plug-in/file format. In auto-detection mode, the user should confirm the result of the auto-detection or select a worker plug-in/file format if more than one was found. Step 3. This invokes the job configuration sequence for the selected worker plug-in. This is not straightforward, since the worker plug-in most likely has parameters that it is hard to provide values for. For example, the worker plug-in most likely requires a single raw bioassay item and a single file item. But we have many raw bioassays, each one with different files. Another case is the Illumina IBS platform which may have more than one file. We need some kind of proxy wrapper so we can "fool" the job configuration sequence to complete. Then, when the multi-import is executing we need to replace the wrapper parameters with the real raw bioassay and file(s). This parameter wrapping can be designed in a couple of different ways. The best would be if we could let the worker plug-in tell us about which parameter to wrap and how to provide values for them. This can be done by defining an interface that the worker plug-ins can implement. But, we also need to support existing plug-ins and this means that we somehow need to guess/assume a few things about them. I can think of the following: 1. The worker plug-in must have a parameter asking for a single raw bioassay. The value for this parameter is set to each of the selected raw bioassays. 2. The worker plug-in must have at least one parameter asking for a file. The value for this parameter is set to the file that is attached to the raw bioassay that has the generic type FileType.RAW_DATA. 3. If the worker plug-in has more than one file parameter (for example the Illumina IBS platform), there must be the same number of files attached to the raw bioassay. In the Illumina case it doesn't matter which file we attach to which file parameter, and I don't know how we should be able to guess which file goes were if it does. Step 4. When all parameters (error handling options, etc.) for the worker plug-in has been selected the job is queued as any other job. Running the plug-in ------------------- The actual running of the multi-importer plug-in follows this outline: 1. We are looping over the selected raw bioassays 2. An instance of the selected worker plug-in is created The Plug-in API doesn't allow us to re-use a plug-in instance, so a new instance has to be created for each raw bioassay. There are a few things to be aware of: - The instance must be properly initialised in a way that is compatible with how the core initialises a plug-in. - Signalling must be setup if we want to support aborting the plug-in (and we do want that). 2. The job configuration sequence for the worker plug-in is started. The current raw bioassay and file are used as parameters according to the rules above. Other parameters, such as error handling options are taken from the multi- importer job configuration. 3. The worker plug-in performs the import. 4. 2 and 3 is repeated for each raw bioassay. The result is reported as the number of raw bioassays that got imported/failed. The multi-importer should support logging to a log file more detailed information about the indivudual imports.