An experiment represents an experiment carried out using a set
of microarrays, including the analysis steps taken.
Raw data sets can be associated with an experiment, and may be
dissociated from it at any time.
The owner of an experiment also owns all the analysis steps
and other information contained in the experiment, and all access
control is done on the experiment level.
An experiment has a number of channels. This is the number of
intensities handled for each spot in the analysis. There is no
restriction on the number of channels in the raw data sets
associated with an experiment.
[Major implementation detail] It does not need to possible to
query against more than one experiment at a time, so the bulk of
the data for an experiment may be stored in a set of tables
created specifically for that one experiment.
A bioassay represents a set of measurements across a number of
features, reporters, or other entities. Typically, it represents
intensities measured for the spots of a raw data set.
A bioassay consists of a number of spots. Each spot has:
a position number, unique within the bioassay
a reporter
as many intensity values as the experiment has channels
In addition to this, extra values may be attached. See below for
details on this.
A bioassay always exists as part of a single bioassayset.
The bioassaysets of an experiment form a forest of bipartite
trees, with a transformation separating a bioassayset
from its parent bioassayset.
A transformation represents a filtering of the data
in a bioassayset (in which case it has a single child bioassayset),
or an arbitrary transformation (in which case there may be zero
or more child bioassaysets), or the extraction of intensity values
from the raw data.
If a bioassayset is not at the root level (i.e., if its parent
transformation is not a root), its
bioassays each have a set of parents, which must be part of the
bioassayset's parent bioassayset.
Each bioassay points to the set of raw data sets from which its
intensity values are derived.
A root bioassayset may be created from any non-empty set of
raw data sets that are associated with the experiment. This
creation should be handled by a plugin, as it may be a complex
task. A plugin for the most common and simple case is described
in the next section.
A bioassayset may be marked as containing log ratios rather than
intensity values. This information is meant to be used by clients
only, and may be useful when the bioassays are created as
comparisons between pairs of bioassays.
Bioassays are annotatable, and should inherit annotations from
their upstream biomaterials, raw data sets and array slides...
Some transformations need to merge spots. This means that in the
general case positions alone will not be enough to identify the
parent spot(s) of a bioassay's spots. Therefore there should be
a position mapping table, where positions on a bioassay are
mapped to positions on its parent bioassay(s).
Either all bioassays of a bioassayset use the mapping table, or
none of them do.
There is a similar table for mapping to positions on
the raw data sets. A position on a bioassay may map to multiple
raw spots (it may also do this merely by being associated with
multiple raw data sets).
Either all bioassays of a bioassayset use the raw mapping table,
or none of them do.
A root bioassayset may use the raw mapping table. If a bioassayset
uses the raw mapping table, its descendants must also do so.
If a bioassayset uses the raw mapping table, each of its bioassays
may hold the id of an ancestor which had the same raw mapping as
itself. This will make it possible to avoid unnecessary duplication
of raw mappings, in the case that the transformation is a filtering
which does not operate on the raw data.
When a root bioassayset is created, its bioassay's position should
if possible uniquely define features on the array designs used for
the raw data sets. If a lack of LIMS information makes this
impossible, the positions should at the very least uniquely
define reporters. This means that if two bioassays are created from
raw data sets which have different array designs, they should have
non-overlapping position numbers, but if there is no array design
information the positions should at least be remapped so that no
two spots have the same position but different reporters. If any
bioassay spot ends up with a different position than the
corresponding raw spot, then the bioassayset must be stored with
a raw position mapping.
Extra values may be attached to the spots of a bioassayset. A
privileged user must first define data types for extra values,
for example "standard deviation" or "error measure xyz". Other
value types may be allowed in the future, but for now it will be
enough to allow floating-point values. Each bioassayset has a
list of the extra data types its spots have, and each spot must
have exactly one value of each such data type (and NULL should be
allowed).
To make it easier to retain extra values though the analysis
steps, a bioassayset's list of extra data types may for each
extra data type point to an ancestor bioassayset whose spots
already have that extra data type attached. This of course requires
that the spot positions have not been remapped between the two
bioassaysets, and that the old bioassayset's extra values are
still valid for the newer bioassayset.
Values may be attached to the positions of a bioassayset. As with
values attached to spots, an admin must create the data type. All
positions must have an attached value, and again the list of
attached data types for a bioassayset may point to ancestor
bioassaysets that have the same attached values.
[Q] Do we need to duplicate the previous point
for per-reporter data? Maybe we're OK now that positions map
uniquely to reporters anyway?
This is described here because there's nowhere better to place
it at the moment.
Extracting one intensity value per channel from the raw data set
is not entirey trivial, as there may be many measures of intensity
available, and different ways of doing background correction.
A (properly privileged) user should be able to define intensity
measures for a given raw data type and number of channels. For
each channel, an intensity measure consists of a set of:
column specification
coefficient
flag: spot value or mean over raw data set
That is, an intensity measure says (for each channel) what columns
should be used, how much it should contribute to the intensity
value, and whether the mean over all spots should be used instead
of the value for each individual spot.
The bioassays of a root bioassayset may be created from different
intensity measures, and each such bioassay should hold information
about what intensity measure was used to create it.