Store data in files

NOTE! This document is outdated and has been replaced with newer documentation. See Core API : Using files to store data

This document explains how to use BASE for storing data in original files instead of in the database. This API solves the following problems:

Contents

  1. Diagram of classes and methods
  2. Asking the user for files
  3. Link to the selected files
  4. Validate the file and extract metadata
  5. Import data into the database
  6. Pre-installed platforms
  7. Remaining problems

See also

Last updated: $Date: 2008-09-11 22:01:44 +0200 (to, 11 sep 2008) $

1. Diagram of classes and methods

2. Asking the user for files

Given that we have a FileXxxAble object (for example a RawBioAssay or ArrayDesign we use the getPlatform() to load the associated platform. This is a required property. Now, after executing the query we get from Platform.getFileXxxTypes() we have a list of FileXxxType object. Each one describes a specific type of file that can be used on the given platform. For example:

In fact, we can get the list of FileXxxType object for any type of item using the simple code below:

DbControl dc = ...
FileXxxAble item = ....
Platform p = item.getPlatform();
List<FileXxxType> fileType = 
  p.getFileXxxTypes(item.getItemType()).list(dc);
// Now, ask the user to select one file for each type

3. Link to the selected files

When the user has selected the file(s) we must store the links to them in the database. This is done via a FileSet. A file set contains 0, one or more files. The only limitation is that it can only contain one file of each FileXxxType. Call FileSet.addMember to store a file in the file set. If a file already exists for the given file type, it is replaced, otherwise a new entry is created.

4. Validate the file and extract metadata

Validation and extraction of metadata is an important part if we want data in files to be equivalent to data in the database. The validation and metadata extraction is normally performed when adding a file to a fileset.

Each FileXxxType may store the classname of a FileValidator and a MetadataReader. If so, they are used when a file is added to the file set. An important thing is that if the same class is used for both validation and metadata reading, only one instance is created.

FileXxxAble item = ...
FileXxxType type = ...
File file = ...

FileValidator validator = type.getValidator();
MetadataReader reader = type.getMetadataReader();

validator.setFile(file);
validator.setItem(item);
// Repeat for 'reader' if not same as 'validator'
validator.validate();
reader.writeMetadata();

All validators and metadata readers should extend the AbstractFileHandler. The reason is that I feel that we may have to add more methods to the FileHandler interface in the future. The AbstractFileHandler will then provide default implementations.

5. Import data into the database

TODO....

...but I think this is done by the already existing plug-ins in more or less the same manner as before. The may benfit from already selected file(s), so it would probably be a good idea to make them aware of the FileSet to offer good default values.

// Get file to use a default value
File defaultFile = null;
RawBioAssay rba = ...
FileSet fileSet = rba.getFileSet();
if (fileSet != null)
{
  List list = fileSet.getMembers(DataType.RAW_DATA);
  if (list.size() > 0)
  {
    defaultFile = list.get(0).getFile();
  }
}

The auto detect option of the web interface should also be made aware of this.

6. Pre-installed platforms

BASE ships with a number of platforms already pre-installed. It is important that the external ID of the platform of file types are not changed.

Platform File types
Name ID Data type Name ID
Generic generic RAW_DATA Raw data file generic.raw
FEATURE_DATA Print map generic.printmap
FEATURE_DATA Reporter map generic.reportermap
Affymetrix affymetrix RAW_DATA Affymetrix CEL file affymetrix.cel
FEATURE_DATA Affymetrix CDF file affymetrix.cdf

Servers that are upgrading from previous releases are assigned the generic platform unless the array design is an affy chip and the raw bioassay is Affymetrix raw data type.

7. Remaining problems