Opened 8 months ago

Closed 7 months ago

#2321 closed task (fixed)

Implement a tool for migrating raw bioassays to derived bioassays

Reported by: Nicklas Nordborg Owned by: everyone
Priority: major Milestone: BASE 3.19.11
Component: core Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

Future development of BASE may remove things that are not used much anymore. For example, raw bioassays with raw data imported into the database, experiments, array lims, etc.

We have a lot of data at the raw bioassay level but they are also a bit problematic since it is a dead end in the sense that it is not possible to create more child items for other analysis. Until now it has been solved by adding more files and/or annotations to an existing raw bioassay.

It would be a lot more flexible if we could move existing raw bioassays to the derived bioassay level instead. In theory it could be done with batch importers, but in practice it would be better to implement a tool for that.

The general idea is to create a derived bioassay copy of each raw bioassay. The platform/variant/rawdatatype are used to map to a subtype. Existing annotations, files, any-to-any links, etc. are re-linked to the new derived bioassays (they will no longer be available on the raw bioassay).

Below is more detailed description (not yet complete):

Database columns

RawBioAssays DerivedBioAssays Comment
id id A new ID is generated
version version Copy
diskusage_id - Not used
annotationset_id annotationset_id Copy and clear
fileset_id fileset_id Copy and clear
entry_date entry_date Copy
platform_id
variant_id
rawdatatype
subtype_id Platform and rawdata type is mapped to a subtype
job_id job_id Copy
protocol_id protocol_id Copy
software_id software_id Copy
arraydesign_id - Create an AnyToAny-link
bioassay_id - Link via ParentDerivedBioassays and ParentPhysicalBioAssays
extract_id extract_id Copy
name name Copy
description description Copy
removed_by removed_by Copy
itemkey_id itemkey_id Copy
projectkey_id projectkey_id Copy
owner owner Copy
has_data - Not used
spots - Not used
file_spots - Not used
bytes - Not used
- is_root false
- kit_id null
- hardware_id null

Annotations

Annotations can be moved to the new derived bioassay by updating the AnnotationSets table with the new id.

AnnotationSets Comment
id Keep
version +1
item_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
item_id Change to new id

NOTE! Cached annotations (in the static.cache/snapshots-v5 directory) must be deleted. ~The simplest thing is to delete the entire directory and everything in it.~ It is easy to delete from the code since we already have SnapshotManager.removeSnapshots() method.

NOTE! Annotation types that are enabled for raw bioassays but not derived bioassays need to be updated. We can implement this in the tool as well. We would need to insert an entry into the AnnotationTypeItems table for each annotation type:

AnnotationTypeItems Comment
annotationtype_id Id of annotation type
item_type 268

Files

Files can be moved to the new derived bioassay by updating the FileSets table with the new id.

FileSets Comment
id Keep
version +1
item_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)

But, we also need to address the fact that the file types of the members in the file set must match the new item type. We can either change the file types to new types or we can change the item type on the existing file types. In the first case we update the FileSetMembers table:

FileSetMembers Comment
id Keep
version +1
fileset_id Keep
datafiletype_id Update to new type
other columns Keep

If there are any remaining file types for raw bioassays that has been migrated we need to update them to derived bioassays instead. This is in the DataFileTypes table:

DataFileTypes Comment
id Keep
version +1
item_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
other columns Keep

Note that we don't do this for all file types, but only for the types that are used by the migrated raw bioassays.

Links are moved to the new derived bioassay by updating the AnyToAny table. We need to check both the source and target of the links.

AnyToAny Comment
id Keep
version +1
name Keep
description Keep
from_id Keep or change to new id
from_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
to_id Keep or change to new id
to_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
uses_to Keep

Change history

The change history is moved to the new derived bioassay by updating the ChangeHistoryDetails table. This will leave the old raw bioassay without a change history. I think we should insert a new entry representing the migration. We should also insert a new entry for the derived bioassay.

ChangeHistoryDetails Comment
id Keep
version +1
history_id Keep
change_type Keep
item_id Change to new id
item_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
change_info Keep
old_value Keep
new_value Keep

A new entry in the ChangeHistory table is created that represents the migration:

ChangeHistory Comment
id Generated
version 0
time Current timestamp
user_id Id of root user
session_id Id of current session
client_id Id of a new client (net.sf.basedb.clients.rba2dba-migration)
project_id Null
plugin_id Null
job_id Null
name Migrate raw bioassays to derived bioassays

New entries for the raw bioassay and the new derived bioassay in the ChangeHistoryDetails table:

ChangeHistoryDetails Comment
id Generated
version 0
history_id Id of current history
change_type 2 (=UPDATE)
item_id Id of raw bioassay or derived bioassay
item_type 264 (=RAWBIOASSAY) or 268 (=DERIVEDBIOASSAY)
change_info Migrated <name-of-rba> from raw bioassay to derived bioassay
old_value Null
new_value Null

Job parameters

Some jobs have items as parameters and the items can be raw bioassays. This information is stored in the ItemValues table.

ItemValues Comment
id Keep
data_class Change net.sf.basedb.core.data.RawBioAssayData to net.sf.basedb.core.data.DerivedBioAssayData
data_class_id Change to new id

Item lists

All item lists with raw bioassays as members are converted to item lists with derived bioassays. We need to change entries in the ItemListMembers table:

ItemListMembers Comment
item_id Change to new id
list_id Keep

And in the ItemLists table:

ItemLists Comment
version +1
member_type Change 264 (=RAWBIOASSAY) to 268 (=DERIVEDBIOASSAY)
subtype_id Rawdata type is mapped to a subtype
rawdatatype Null
size Updated to match count
All other columns Keep

NOTE! Synchronization filters are not updated since it is not possible to automatically make a working filter in all cases. It may not be enough to just change the type from raw bioassay to derived bioassay.

All item lists that have at least one synchronization filter that is used on the raw bioassay level are marked with {:} to make them easy to spot in the web interface. The filters need to be manually updated and when they have been fixed the marking can be removed.

Change History (26)

comment:1 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 8 months ago

In 8200:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Added an entry point to the OneTimeFix implementation. The migration can be started with onetimefix.sh migrate_rba2dba -u <root> -p <password> -c <config>

comment:3 by Nicklas Nordborg, 8 months ago

In 8201:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Started to implement the migration utility. It currently count all raw bioassays and load the information about them that is needed. Nothing is yet created or moved.

comment:4 by Nicklas Nordborg, 8 months ago

In 8202:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Implemented creation of derived bioassay.

comment:5 by Nicklas Nordborg, 8 months ago

In 8203:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Implemented re-linking annotation sets to the new derived bioassay.

comment:6 by Nicklas Nordborg, 8 months ago

In 8204:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Adding links from the new derived bioassay to the parent derived bioassay and physical bioassays.

comment:7 by Nicklas Nordborg, 8 months ago

In 8205:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Link the file set to the new derived bioassay.

comment:8 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:9 by Nicklas Nordborg, 8 months ago

In 8206:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Move any-to-any links to the new derived bioassays. Needed an extra index on the 'from_type' and 'from_id' columns to make the update quer perform well.

comment:10 by Nicklas Nordborg, 8 months ago

In 8207:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Move the change history to the new derived bioassay. There was an extra complication with the two log entries that are added as part of the migration since we do not want to move them to the derived bioassay. Solved by focring a flush() before the log entries are added.

comment:11 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:12 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:13 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:14 by Nicklas Nordborg, 8 months ago

In 8208:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Convert items lists with raw bioassays to item lists with derived bioassays.

comment:15 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:16 by Nicklas Nordborg, 8 months ago

In 8209:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Item lists also need to update subtype and rawdata type and size.

comment:17 by Nicklas Nordborg, 8 months ago

In 8210:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Order raw bioassays by id should keep the same order for derived bioassays. This can be useful since id order is more or less chronological order.

comment:18 by Nicklas Nordborg, 8 months ago

In 8211:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Implemented a simple mapping from rawdata type and/or platform to subtype.

comment:19 by Nicklas Nordborg, 8 months ago

In 8212:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Item lists that have syncfilters that depend on raw bioassay filters are marked with {:} to make them easy to find and fix.

comment:20 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:21 by Nicklas Nordborg, 8 months ago

In 8213:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Data file types must also be handled when moving the file set from a raw bioassay to a derived bioassay. Since a file type can only be associated with a single type of item there are two options.

  1. Change to a different type on the file
  2. Change the type of item on the file type


The first option can now be specified in the configuration file by specifying a mapping in the 5:th column. The second option is applied to all remaining file types that are used among the files that have been migrated.

comment:22 by Nicklas Nordborg, 8 months ago

Description: modified (diff)

comment:23 by Nicklas Nordborg, 8 months ago

In 8215:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

The migration now also create links to ArrayDesign that was associated with the raw bioassay.

comment:24 by Nicklas Nordborg, 8 months ago

In 8216:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Validate that the ID values in the configuration file actually exists in the database before starting the migration.

comment:25 by Nicklas Nordborg, 8 months ago

In 8217:

References #2321: Implement a tool for migrating raw bioassays to derived bioassays

Added a flag for doing a dry-run.

comment:26 by Nicklas Nordborg, 7 months ago

Resolution: fixed
Status: newclosed
Note: See TracTickets for help on using tickets.