29.3.1. Using files to store data

BASE 2.5 introduced the possibility to use files to store data instead of importing it into the database. Files can be attached to any item that implements the FileStoreEnabled interface. Currently this is RawBioAssay and ArrayDesign. The ability to store data in files is not a replacement for storing data in the database. It is possible (for some platforms/raw data types) to have data in files and in the database at the same time. We would have liked to enforce that (raw) data is always present in files, but this will not be backwards compatible with older installations, so there are three cases:

Not all three cases are supported for all types of data. This is controlled by the Platform class, which may disallow that data is stored in the database. To check this call Platform.isFileOnly() and/or Platform.getRawDataType(). If the isFileOnly() method returns true, the platform can't store data in the database. If the value is false more information can be obtained by calling getRawDataType(), which may return:

One major change from earlier BASE versions is that the registration of raw data types has changed. The raw-data-types.xml file should only be used for raw data types that are stored in the database. The storage tag has been deprecated and BASE will refuse to start if it finds a raw data type definitions with storage="file".

For backwards compatibility reasons, each Platform that can only store data in files will create "virtual" raw data type objects internally. These raw data types all return false from the RawDataType.isStoredInDb() method. They also have a back-link to the platform/variant that created it: RawDataType.getPlatform() and RawDataType.getVariant(). These two methods will always return null when called on a raw data type that can be stored in the database.

See also

Diagram of classes and methods

Figure 29.21. Store data in files

Store data in files

This is rather large set of classes and methods. The ultimate goal is to be able to create links between a RawBioAssay / ArrayDesign and File items and to provide some metadata about the files. The FileStoreUtil class is one of the most important ones. It is intended to make it easy for plug-in (and other) developers to access the files without having to mess with platform or file type objects. The API is best described by a set of use-case examples.

Use case: Asking the user for files for a given item

A client application must know what types of files it makes sense to ask the user for. In some cases, data may be split into more than one file so we need a generic way to select files.

Given that we have a FileStoreEnabled item we want to find out which DataFileType items that can be used for that item. The DataFileType.getQuery(FileStoreEnabled) can be used for this. Internally, the method uses the result from FileStoreEnabled.getPlatform() and FileStoreEnabled.getVariant() methods to restrict the query to only return file types for a given platform and/or variant. If the item doesn't have a platform or variant the query will return all file types that are associated with the given item type. In any case, we get a list of DataFileType items, each one representing a specific file type that we should ask the user about. Examples:

  1. The Affymetrix platform defines CEL as a raw data file and CDF as an array design (reporter map) file. If we have a RawBioAssay the query will only return the CEL file type and the client can ask the user for a CEL file.

  2. The Generic platform defines PRINT_MAP and REPORTER_MAP for array designs. If we have an ArrayDesign the query will return those two items.

It might also be interesting to know the currently selected file for each file type and if the platform has set the required flag for a particular file type. Here is a simple code example that may be useful to start from:

DbControl dc = ...
FileStoreEnabled item = ...
Platform platform = item.getPlatform();
PlatformVariant variant = item.getVariant();

// Get list of DataFileTypes used by the platform
ItemQuery<DataFileType> query =
   DataFileType.getQuery(item);
List<DataFileType> types = query.list(dc);

// Always check hasFileSet() method first to avoid
// creating the file set if it doesn't exists
FileSet fileSet = item.hasFileSet() ? 
   null : item.getFileSet();
   
for (DataFileType type : types)
{
   // Get the current file, if any
   FileSetMember member = fileSet == null || !fileSet.hasMember(type) ?
      null : fileSet.getMember(type);
   File current = member == null ? 
      null : member.getFile();
   
   // Check if a file is required by the platform
   PlatformFileType pft = platform == null ? 
      null : platform.getFileType(type, variant);
   boolean isRequired = pft == null ? 
      false : pft.isRequired();
      
   // Now we can do something with this information to
   // let the user select a file ...
}
[Note] Also remember to catch PermissionDeniedException

The above code may look complicated, but this is mostly because of all checks for null values. Remember that many things are optional and may return null. Another thing to look out for is PermissionDeniedException:s. The logged in user may not have access to all items. The above example doesn't include any code for this since it would have made it too complex.

Use case: Link, validate and extract metadata from the selected files

When the user has selected the file(s) we must store the links to them in the database. This is done with a FileSet object. A file set can contain any number of files. The only limitation is that it can only contain one file for each file type. Call FileSet.setMember() to store a file in the file set. If a file already exists for the given file type it is replaced, otherwise a new entry is created. The following program example assumes that we have a map where File:s are related to DataFileType:s. When all files have been added we call FileSet.validate() to validate the files and extract metadata.

DbControl dc = ...
FileStoreEnabled item = ...
Map<DataFileType, File> files = ...

// Store the selected files in the fileset
FileSet fileSet = item.getFileSet();
for (Map.Entry<DataFileType, File> entry : files)
{
   DataFileType type = entry.getKey();
   File file = entry.getValue();
   fileSet.setMember(type, file);
}

// Validate the files and extract metadata
fileSet.validate(dc, true);

Validation and extraction of metadata is important since we want data in files to be equivalent to data in the database. The validation and metadata extraction is done by the core when the FileSet.validate() is called. The process is partly pluggable since each DataFileType can name a class that should do the validation and/or metadata extraction.

[Note] Note

The FileSet.validate() only validates the files where the file types have specified plug-ins that can do the validation and metadata extraction. The method doesn't throw any exceptions. Instead, all validation errors are returned a list of Throwable:s. The validation result is also stored for each file and can be access with FileSetMember.isValid() and FileSetMember.getErrorMessage().

Here is the general outline of what is going on in the core:

  1. The core checks the DataFileType of all members in the file set and creates DataFileValidator and DataFileMetadataReader objects. Only one instance of each class is created. If the file set contains members which has the same validator or metadata reader, they will all share the same instance.

  2. Each validator/metadata reader class is initialised with calls to DataFileHandler.setItem() and DataFileHandler.setFile().

  3. Each validator is called. The result of the validation is saved for each file and can be retreieved by FileSetMember.isValid() and FileSetMember.getErrorMessage().

  4. Each metadata reader is called, unless the metadata reader is the same class as the validator and the validation failed. If the metadata reader is a different class, it is called even if the validation failed.

[Note] Only one instance of each validator class is created

The validation/metadata extraction is not done until all files have been added to the fileset. If the same validator/meta data reader is used for more than one file, the same instance is reused. Ie. the setFile() is called one time for each file/file type pair. The validate() and extractMetadata() methods are only called once.

All validators and meta data extractors should extend the AbstractDataFileHandler class. The reason is that we may want to add more methods to the DataFileHandler interface in the future. The AbstractDataFileHandler will be used to provide default implementations for backwards compatibility.

Use case: Import data into the database

This should be done by existing plug-ins in the same way as before. A slight modification is needed since it is good if the importers are made aware of already selected files in the FileSet to provide good default values. The FileStoreUtil class is very useful in cases like this:

RawBioAssay rba = ...
DbControl dc = ...

// Get the current raw data file, if any
List<File> rawDataFiles = 
   FileStoreUtil.getGenericDataFiles(dc, rba, FileType.RAW_DATA);
File defaultFile = rawDataFiles.size() > 0 ?
   rawDataFiles.get(0) : null;
   
// Create parameter asking for input file - use current as default
PluginParameter<File> fileParameter = new PluginParameter<File>(
   "file",
   "Raw data file",
   "The file that contains the raw data that you want to import",
   new FileParameterType(defaultFile, true, 1)
);

An import plug-in should also save the file that was used to the file set:

RawBioassay rba = ...
// The file the user selected to import from
File rawDataFile = (File)job.getValue("file");

// Save the file to the fileset. The method will check which file 
// type the platform uses as the raw data type. As a fallback the
// GENERIC_RAW_DATA type is used
FileStoreUtil.setGenericDataFile(dc, rba, FileType.RAW_DATA, 
   DataFileType.GENERIC_RAW_DATA, rawDataFile);

Use case: Using raw data from files in an experiment

Just as before, an experiment is still locked to a single RawDataType. This is a design issue that would break too many things if changed. If data is stored in files the experiment is also locked to a single Platform. This has been designed to have as little impact on existing plug-ins as possible. In most cases, the plug-ins will continue to work as before.

A plug-in (using data from the database that needs to check if it can be used within an experiment can still do:

Experiment e = ...
RawDataType rdt = e.getRawDataType();
if (rdt.isStoredInDb())
{
   // Check number of channels, etc...
   // ... run plug-in code ...
}

A newer plug-in which uses data from files should do:

Experiment e = ...
DbControl dc = ...
RawDataType rdt = e.getRawDataType();
if (!rdt.isStoredInDb())
{
   // Check that platform/variant is supported
   Platform p = rdt.getPlatform(dc);
   PlatformVariant v = rdt.getVariant(dc);
   // ...

   // Get data files
   File aFile = FileStoreUtil.getDataFile(dc, ...);
   
   // ... run plug-in code ...
}