BASE - Core specification - Experiments and analysis

Experiments and analysis

This document covers the details of how BASE groups data into experiments and performs analysis on it.

Contents

Experiment
Bioassayset and bioassay
The intensity measure plugin
Filtering data

Last updated: $Date: 2009-04-06 14:52:39 +0200 (mÃ¥, 06 apr 2009) $

1. Experiment

An experiment represents an experiment carried out using a set of microarrays, including the analysis steps taken.

Raw data sets can be associated with an experiment, and may be dissociated from it at any time.

The owner of an experiment also owns all the analysis steps and other information contained in the experiment, and all access control is done on the experiment level.

An experiment has a number of channels. This is the number of intensities handled for each spot in the analysis. There is no restriction on the number of channels in the raw data sets associated with an experiment.

[Major implementation detail] It does not need to possible to query against more than one experiment at a time, so the bulk of the data for an experiment may be stored in a set of tables created specifically for that one experiment.

2. Bioassayset and bioassay

A bioassay represents a set of measurements across a number of features, reporters, or other entities. Typically, it represents intensities measured for the spots of a raw data set.

A bioassay consists of a number of spots. Each spot has:

a position number, unique within the bioassay
a reporter
as many intensity values as the experiment has channels

In addition to this, extra values may be attached. See below for details on this.

A bioassay always exists as part of a single bioassayset.

The bioassaysets of an experiment form a forest of bipartite trees, with a transformation separating a bioassayset from its parent bioassayset.

A transformation represents a filtering of the data in a bioassayset (in which case it has a single child bioassayset), or an arbitrary transformation (in which case there may be zero or more child bioassaysets), or the extraction of intensity values from the raw data.

If a bioassayset is not at the root level (i.e., if its parent transformation is not a root), its bioassays each have a set of parents, which must be part of the bioassayset's parent bioassayset.

Each bioassay points to the set of raw data sets from which its intensity values are derived.

A root bioassayset may be created from any non-empty set of raw data sets that are associated with the experiment. This creation should be handled by a plugin, as it may be a complex task. A plugin for the most common and simple case is described in the next section.

A bioassayset may be marked as containing log ratios rather than intensity values. This information is meant to be used by clients only, and may be useful when the bioassays are created as comparisons between pairs of bioassays.

Bioassays are annotatable, and should inherit annotations from their upstream biomaterials, raw data sets and array slides...

Some transformations need to merge spots. This means that in the general case positions alone will not be enough to identify the parent spot(s) of a bioassay's spots. Therefore there should be a position mapping table, where positions on a bioassay are mapped to positions on its parent bioassay(s).

Either all bioassays of a bioassayset use the mapping table, or none of them do.

There is a similar table for mapping to positions on the raw data sets. A position on a bioassay may map to multiple raw spots (it may also do this merely by being associated with multiple raw data sets).

Either all bioassays of a bioassayset use the raw mapping table, or none of them do.

A root bioassayset may use the raw mapping table. If a bioassayset uses the raw mapping table, its descendants must also do so.

If a bioassayset uses the raw mapping table, each of its bioassays may hold the id of an ancestor which had the same raw mapping as itself. This will make it possible to avoid unnecessary duplication of raw mappings, in the case that the transformation is a filtering which does not operate on the raw data.

When a root bioassayset is created, its bioassay's position should if possible uniquely define features on the array designs used for the raw data sets. If a lack of LIMS information makes this impossible, the positions should at the very least uniquely define reporters. This means that if two bioassays are created from raw data sets which have different array designs, they should have non-overlapping position numbers, but if there is no array design information the positions should at least be remapped so that no two spots have the same position but different reporters. If any bioassay spot ends up with a different position than the corresponding raw spot, then the bioassayset must be stored with a raw position mapping.

Extra values may be attached to the spots of a bioassayset. A privileged user must first define data types for extra values, for example "standard deviation" or "error measure xyz". Other value types may be allowed in the future, but for now it will be enough to allow floating-point values. Each bioassayset has a list of the extra data types its spots have, and each spot must have exactly one value of each such data type (and NULL should be allowed).

To make it easier to retain extra values though the analysis steps, a bioassayset's list of extra data types may for each extra data type point to an ancestor bioassayset whose spots already have that extra data type attached. This of course requires that the spot positions have not been remapped between the two bioassaysets, and that the old bioassayset's extra values are still valid for the newer bioassayset.

Values may be attached to the positions of a bioassayset. As with values attached to spots, an admin must create the data type. All positions must have an attached value, and again the list of attached data types for a bioassayset may point to ancestor bioassaysets that have the same attached values.

[Q] Do we need to duplicate the previous point for per-reporter data? Maybe we're OK now that positions map uniquely to reporters anyway?

3. The intensity measure plugin

This is described here because there's nowhere better to place it at the moment.

Extracting one intensity value per channel from the raw data set is not entirey trivial, as there may be many measures of intensity available, and different ways of doing background correction. A (properly privileged) user should be able to define intensity measures for a given raw data type and number of channels. For each channel, an intensity measure consists of a set of:

column specification
coefficient
flag: spot value or mean over raw data set

That is, an intensity measure says (for each channel) what columns should be used, how much it should contribute to the intensity value, and whether the mean over all spots should be used instead of the value for each individual spot.

The bioassays of a root bioassayset may be created from different intensity measures, and each such bioassay should hold information about what intensity measure was used to create it.

4. Filtering data

The task of the filtering system is to filter the spots of one bioassayset, producing a new bioassayset.