BASE - Core specification - Hybridizations and raw data

Hybridizations and raw data

This document covers the details of how hybridizations and raw data is handled by BASE.

Contents

Hybridizations
Scans
Raw data sets
Raw data
Spot images

See also

Implementation overview

Last updated: $Date: 2009-04-06 14:52:39 +0200 (mÃ¥, 06 apr 2009) $

1. Hybridizations

A hybridization is attached to a list set of labeled extracts. The same labeled extract may be used several times. The position in the list does not mean anything to BASE, but may be used by plugins subsequently used to create derived data from raw data.

A hybridization may be attached to an array slide.

The hybridization may be dissociated from the array slide and the labeled extracts at any time.

A hybridization protocol must may be picked.

A hybridiation may be annotated, but annotations on its labeled extracts are not transferred to it.

2. Scans

A scan (or image acquisition) represents the scanning of a slide.

A hybridization may have any number of scans.

A scan is associated with a scanner as well as with a scanning protocol.

Images may be attached to a scan.

An image consists of a pointer to an uploaded file, information about what channel(s) it has to do with, what format it's in (TIFF or JPEG), whether it's a preview or the full image, and whether it should be used for generating spot images.

3. Raw data sets

A raw data set describes the result of applying some software to a set of images (obtained from scanning a microarray) in order to quantify the spots and identify them with features or reporters. This includes the generated spot quantifications, which we refer to as raw data.

A raw data set normally belongs to a scan, but it should also be possible to create raw data sets with no connection to a scan.

It should be possible to attach a scan-less raw data set to a scan at a later stage.

A raw data set is can be associated with an software item.

The file(s) generated by the software should be attached to the raw data set. Typically this is one file, which we refer to as a raw result file. [NOTE] As it is implemented, only one file can be attached.

A raw data set may point to the array design it has to do with, if any, but only if the array design has features.

The array design of a raw data set is typically that of its hybridization's array slide, but it doesn't have to be. A raw data set created without connection to a scan may still point to an array design.

4. Raw data

Because different software produces different sets of spot measurements, it should be possible to define new types of raw data.

There is a single table which is used for all raw data types, in which information common to all types is stored.

The columns common to all types of raw data are at least:

id of the raw data set
position in the raw data set (typically N for the Nth spot in a raw data file)
id of the reporter thought to occupy the spot
id of the feature which corresponds to this spot, if any. This is only allowed if the raw data set has an array design, and the features must match the spot's coordinates and reporter.
physical coordinates of the spot (possibly in pixels)
grid coordinates of the spot, including meta coordinates.
user-provided flagging (see below)

For each type of raw data, there is one table with type-specific data. This table is described in detailed in the database. For each column, the following is recorded:

Column name and type
Whether the column holds an intensity, a standard deviation, or neither
Whether the column holds a foreground value, a background value, or neither
Whether the column holds a mean or median or neither
An optional label id, in the case that the raw data type concerns itself with labels.

In the table with raw data type specific columns, the spots should be identified by raw data set and position. We use the id of the rawdata entry.

The raw data set should store not only what type of raw data it contains, but also which of the type-specific and non-type-specific columns it uses.

Raw spots may be flagged/commented by users. There should be a table with possible comments (modifiable by some users only), and each spot may point to such a comment. This is the only property of a raw spot that may change after the raw data set is added. The raw data set should know the datetime of the last change to one of its spots.

5. Spot images

By spot images we mean small images of the individual spots of a raw data set, meant to convey information about the morphology of spots. These images are meant to be shown to users, often many at a time from an arbitrary set of spots.

It should be possible to generate spot images from a raw data set whose spots have physical coordinates specified, if it is connected to a scan which has sufficient images attached, and if it has no more than three channels. If it has more than three channels a user may select up to three images for the spot image generation.

By sufficient images we mean one high-resolution TIFF image per channel, possibly stored in a single file.

The user may need to enter information about how to scale and offset the physical spot coordinates to get the corresponding image coordinates. This information might be extracted from the raw result file (or from the images).

The size of the area to cut out for each spot image needs to be given by the user.

The input images cannot be visualized without modification, as the dynamic range of the scanner far exceeds that of the user's screen, and most spots would be completely black if rescaled to 8 bits per gun. Therefore, the colors of each spot should be rescaled to use the full intensity range, with the same rescaling done on all 1-3 channels. Gamma correction may also be applied before going to 8 bpg.

The spot images should be saved as JPEG or some other format with good compression. To avoid an excessive number of small space-wasting files, they should be lumped together in reasonable numbers before compression. With JPEG, this means that the each spot image should be a square with side divisible by 8 pixels (to avoid interference between spots).

The scales, offsets, spotsize, gamma correction and JPEG quality value should be stored in the database, along with the identities of the images used to create spot images.

It should be possible to remove and re-generate spot images. When no spot images exist for a raw data set, the parameters for generating them may be altered, but when spot images exist they may not.