1 |
|
---|
2 | Generic multi-importer for raw data
|
---|
3 | ===================================
|
---|
4 |
|
---|
5 | What it is
|
---|
6 | ----------
|
---|
7 | A plug-in that can import data to multiple raw bioassays in
|
---|
8 | one go. The plug-in doesn't do the data import itself, but uses
|
---|
9 | already existing plug-ins that import raw data to a single raw
|
---|
10 | bioassay at a time.
|
---|
11 |
|
---|
12 | The plug-in can only import data of a single type at a time. This
|
---|
13 | means that the actual plug-in/file format that performs the data
|
---|
14 | import must be the same for all raw bioassays that are imported
|
---|
15 | in a single session. We believe that this limitation is not a
|
---|
16 | very severe one, since in the typical use case there is only a
|
---|
17 | limited number of data types/file formats present in a experiment.
|
---|
18 | Most experiments probably only have one, and a dye-swap experiment
|
---|
19 | two.
|
---|
20 |
|
---|
21 | In the text below we will call the actual import plug-in for the
|
---|
22 | "worker plug-in".
|
---|
23 |
|
---|
24 |
|
---|
25 | When and were to use the plug-in
|
---|
26 | --------------------------------
|
---|
27 | The importer should be an import-type plug-in that works from
|
---|
28 | the single-item view of an experiment.
|
---|
29 | Eg. GuiContext = EXPERIMENT, ITEM
|
---|
30 |
|
---|
31 | It should only work for platforms that supports importing raw
|
---|
32 | data into the database. Eg. isInContext() should check:
|
---|
33 | Experiment.getRawDataType().isStoredInDb() == true.
|
---|
34 |
|
---|
35 | It should also check that at least one of the raw bioassays in the
|
---|
36 | experiment doesn't have imported data already, and that there
|
---|
37 | are files attached that it is possible to use for the raw data import.
|
---|
38 |
|
---|
39 |
|
---|
40 |
|
---|
41 | Parameter input
|
---|
42 | ---------------
|
---|
43 |
|
---|
44 | Step 1.
|
---|
45 | The first step is to select which of the raw bioassays that doesn't
|
---|
46 | have raw data that we should import raw data to. The requirement is
|
---|
47 | that the raw bioassays doesn't already have raw data and has files
|
---|
48 | attached to them.
|
---|
49 |
|
---|
50 | In this step the user should also select if the a worker plug-in and/or
|
---|
51 | file format should be selected manually or by trying to auto-detect
|
---|
52 | a suitable file format.
|
---|
53 |
|
---|
54 |
|
---|
55 | Step 2.
|
---|
56 | In manual selection mode, the user is allowed to select a worker
|
---|
57 | plug-in/file format.
|
---|
58 |
|
---|
59 | In auto-detection mode, the user should confirm the result of the
|
---|
60 | auto-detection or select a worker plug-in/file format if more than
|
---|
61 | one was found.
|
---|
62 |
|
---|
63 | Step 3.
|
---|
64 | This invokes the job configuration sequence for the selected worker plug-in.
|
---|
65 | This is not straightforward, since the worker plug-in most likely has
|
---|
66 | parameters that it is hard to provide values for. For example, the worker
|
---|
67 | plug-in most likely requires a single raw bioassay item and a single file
|
---|
68 | item. But we have many raw bioassays, each one with different files.
|
---|
69 | Another case is the Illumina IBS platform which may have more than one
|
---|
70 | file.
|
---|
71 |
|
---|
72 | We need some kind of proxy wrapper so we can "fool" the job configuration
|
---|
73 | sequence to complete. Then, when the multi-import is executing we need to
|
---|
74 | replace the wrapper parameters with the real raw bioassay and file(s).
|
---|
75 |
|
---|
76 | This parameter wrapping can be designed in a couple of different ways.
|
---|
77 | The best would be if we could let the worker plug-in tell us about which
|
---|
78 | parameter to wrap and how to provide values for them. This can be done by
|
---|
79 | defining an interface that the worker plug-ins can implement. But,
|
---|
80 | we also need to support existing plug-ins and this means that we somehow
|
---|
81 | need to guess/assume a few things about them. I can think of the following:
|
---|
82 |
|
---|
83 | 1. The worker plug-in must have a parameter asking for a single raw bioassay.
|
---|
84 | The value for this parameter is set to each of the selected raw bioassays.
|
---|
85 |
|
---|
86 | 2. The worker plug-in must have at least one parameter asking for a file.
|
---|
87 | The value for this parameter is set to the file that is attached to
|
---|
88 | the raw bioassay that has the generic type FileType.RAW_DATA.
|
---|
89 |
|
---|
90 | 3. If the worker plug-in has more than one file parameter (for example the
|
---|
91 | Illumina IBS platform), there must be the same number of files attached
|
---|
92 | to the raw bioassay. In the Illumina case it doesn't matter which file we
|
---|
93 | attach to which file parameter, and I don't know how we should be able to
|
---|
94 | guess which file goes were if it does.
|
---|
95 |
|
---|
96 | Step 4.
|
---|
97 | When all parameters (error handling options, etc.) for the worker plug-in
|
---|
98 | has been selected the job is queued as any other job.
|
---|
99 |
|
---|
100 |
|
---|
101 | Running the plug-in
|
---|
102 | -------------------
|
---|
103 | The actual running of the multi-importer plug-in follows this outline:
|
---|
104 |
|
---|
105 | 1. We are looping over the selected raw bioassays
|
---|
106 | 2. An instance of the selected worker plug-in is created
|
---|
107 | The Plug-in API doesn't allow us to re-use a plug-in
|
---|
108 | instance, so a new instance has to be created for each
|
---|
109 | raw bioassay. There are a few things to be aware of:
|
---|
110 | - The instance must be properly initialised in a way that
|
---|
111 | is compatible with how the core initialises a plug-in.
|
---|
112 | - Signalling must be setup if we want to support aborting
|
---|
113 | the plug-in (and we do want that).
|
---|
114 | 2. The job configuration sequence for the worker plug-in is
|
---|
115 | started. The current raw bioassay and file are used as
|
---|
116 | parameters according to the rules above. Other parameters,
|
---|
117 | such as error handling options are taken from the multi-
|
---|
118 | importer job configuration.
|
---|
119 | 3. The worker plug-in performs the import.
|
---|
120 | 4. 2 and 3 is repeated for each raw bioassay.
|
---|
121 |
|
---|
122 | The result is reported as the number of raw bioassays that got
|
---|
123 | imported/failed. The multi-importer should support logging to
|
---|
124 | a log file more detailed information about the indivudual imports.
|
---|
125 |
|
---|