Opened 16 years ago

Closed 12 years ago

Last modified 12 years ago

#1153 closed enhancement (fixed)

Handling short read transcript sequence data

Reported by: base Owned by: Nicklas Nordborg
Priority: critical Milestone: BASE 3.0
Component: core Version:
Keywords: Cc:

Description

This may not be a "core" issue, it might be achieved with core or contributed plugins alone. As with most new technologies some core changes might be needed though.

Bob MacCallum wrote in http://www.mail-archive.com/basedb-users@lists.sourceforge.net/msg01559.html

I'm just thinking out loud about how to incorporate high throughput
transcriptome sequencing data into BASE.  It's some way off, but I'm assuming
that it will be cheap and quantitative enough to replace arrays at some point
during the renewal period of our project (2009-2014).

1. Create an "array design" with all genes of interest (ideally this would be
   the largest set possible, e.g. known genes + predicted genes of all
   qualities, perhaps even predicted genes from the new sequence data).  The
   layout would be fictitious, of course (what's the minimum one can get away
   with?).

2. Create a rawbioassay to correspond to each sequencing run.

Then *one* of 3a/b/c for each sequencing run/rawbioassay:

3a. Outside BASE, align the new sequences to genome or transcript sequences
    and calculate "intensities" for each gene on the "array design" and dump
    into a tab delimited raw data file.  Attach that file to the rawbioassay
    and import numeric data as usual.

3b. Upload the text file of sequences to the raw bioassay's "data file".
    Create a BASE plugin to do the the alignment and quantification as in 3a,
    and load the numeric data into the database.

3c. Similar to 3b, but calculate the intensities at the "create root bioassay"
    step, similar to the Affymetrix RMA plugin.

4. continue with analysis as normal.  biosources, samples etc can be linked to
   the bioassay too, of course.

I guess a new raw data type (for "Generic" platform) would have to be
created for 3a (and 3b?) but that's not difficult.

Is it possible to go with 3a, but also attach the sequence file to the raw
bioassay (or scan?) - something like keeping tiff files for scans?  Just for
documentation purposes.

Any thoughts from the community or developers?

Jari suggested starting a ticket here.

Attachments (10)

sequencing-uml-draft-1.png (74.9 KB ) - added by Nicklas Nordborg 13 years ago.
First draft of an UML-like diagram
sequencing-draft-1.txt (5.0 KB ) - added by Nicklas Nordborg 13 years ago.
More details and thoughts about the diagram
sequencing-biomaterials-draft-2.png (97.7 KB ) - added by Nicklas Nordborg 13 years ago.
Second draft: UML diagram for biomaterials
sequencing-rawdata-draft-2.png (73.0 KB ) - added by Nicklas Nordborg 13 years ago.
Second draft: UML diagram for raw data
sequencing-draft-2.txt (4.9 KB ) - added by Nicklas Nordborg 13 years ago.
sequencing-update-script-draft-2.txt (5.5 KB ) - added by Nicklas Nordborg 13 years ago.
sequencing-biomaterials-3.png (97.7 KB ) - added by Nicklas Nordborg 13 years ago.
Third version: UML diagram for biomaterials (most of this has been implemented)
sequencing-rawdata-3.png (73.2 KB ) - added by Nicklas Nordborg 13 years ago.
Third version: UML diagram for physical bioassays+raw data (most of this has been implemented)
sequencing-biomaterials-4.png (97.7 KB ) - added by Nicklas Nordborg 13 years ago.
sequencing-rawdata-4.png (73.2 KB ) - added by Nicklas Nordborg 13 years ago.

Download all attachments as: .zip

Change History (50)

in reply to:  description comment:1 by base, 16 years ago

by "documentation purposes" I really mean "archival purposes"

comment:2 by Jari Häkkinen, 15 years ago

Milestone: BASE 2.x+

comment:3 by Nicklas Nordborg, 13 years ago

Milestone: BASE 2.x+BASE 2.18
Owner: changed from everyone to Nicklas Nordborg

We'll try to get something going for this for the next release. We are currently working on some documents and an updated database design. We'll probably add a lot of new stuff (eg. items) to BASE. We'll look into the possibility to assign a "Project type" to a project which should make the web interface switch between different setups:

  • Hide menu items that are not relevant
  • Make sure annotations are inherited along the correct path
  • Changes to the "item overview" function and it's validation options
  • ... and probably more to come...

comment:4 by Nicklas Nordborg, 13 years ago

Priority: minorcritical

comment:5 by Nicklas Nordborg, 13 years ago

Status: newassigned

comment:6 by Nicklas Nordborg, 13 years ago

(In [5597]) References #1153: Handling short read transcript sequence data

Started working on database diagrams.

by Nicklas Nordborg, 13 years ago

Attachment: sequencing-uml-draft-1.png added

First draft of an UML-like diagram

by Nicklas Nordborg, 13 years ago

Attachment: sequencing-draft-1.txt added

More details and thoughts about the diagram

comment:7 by Nicklas Nordborg, 13 years ago

(In [5619]) References #1153: Handling short read transcript sequence data

The updated UML diagram.

by Nicklas Nordborg, 13 years ago

Second draft: UML diagram for biomaterials

by Nicklas Nordborg, 13 years ago

Second draft: UML diagram for raw data

by Nicklas Nordborg, 13 years ago

Attachment: sequencing-draft-2.txt added

by Nicklas Nordborg, 13 years ago

comment:8 by Nicklas Nordborg, 13 years ago

(In [5632]) References #1153: Handling short read transcript sequence data

Replaced Label with Tag. It re-uses much of the old code including the database table.

The code compiles and most test programs pass. Not all gui pages in the web client work. Since LabeledExtract is going to be removed functionality that is related with this may not do what is expected.

comment:9 by Nicklas Nordborg, 13 years ago

(In [5641]) References #1153: Handling short read transcript sequence data

Removed LabeledExtract and related classes. Extract takes it place in most situations, including a temporary link to hybridizations. All tests programs pass and the web client is more or less working. The next step is to replace Hybridization with PhysicalBioAssay.

comment:10 by Nicklas Nordborg, 13 years ago

(In [5642]) References #1153: Handling short read transcript sequence data

Replaced Hybridization with PhysicalBioAssay. All test programs pass and the web client is usable. Terminology has to be modified in some places. Links to Scan remain but will be replaced later.

comment:11 by Nicklas Nordborg, 13 years ago

(In [5652]) References #1153: Handling short read transcript sequence data

Replaced Scan with DerivedBioAssaySet. The code is once again in a status such that it compiles. I am working with the test programs to make sure that what has been done so far is working. The web client can't be used at the moment. The current design is a bit different from the UML. I'll update this later as I implement the remaining functionality.

comment:12 by Nicklas Nordborg, 13 years ago

(In [5653]) References #1153: Handling short read transcript sequence data

Added link between DerivedBioAssay and Extract so that we know were the data is coming from. Item overview loading has been updated to handle this link.

comment:13 by Nicklas Nordborg, 13 years ago

(In [5657]) References #1153: Handling short read transcript sequence data

Added web pages for editing and viewing derived bioassay sets. They'll probably need to be polished up a bit.

The final link to raw bioassay is still not possible. I think we have to re-design the link somewhat. In BASE 2 we used the array index to link back to the correct (labeled) extracts. We can't use that in BASE 3 since for sequencing experiments there doesn't have to be any relation between extracts in the same lane of the flow cell. The idea was to provide a direct link to the extract, but then we loose one of the parents in 2-channel microarray experiments...

comment:14 by Nicklas Nordborg, 13 years ago

(In [5662]) References #1153: Handling short read transcript sequence data

Removed the old "hack" with UsedQuantity and the dummy column used to store the array index. This has now been replaced with BioMaterialEventSource which is almost a full-fledged item and it should be easier to handle and create queries using the information.

comment:15 by Nicklas Nordborg, 13 years ago

(In [5663]) References #1153: Handling short read transcript sequence data

Changed the way parent biomaterials are handled. The pooled property has been replaced with parentType instead. There are two reasons:

  1. Removing the LabeledExtract class has changed all to Extract items which are not pooled.
  2. It doesn't make sense to talk about "pooled" biomaterials when there is only a single parent.

The web gui for biosources and samples have been updated. Extracts does not work yet.

Overview loaders have been updated.

Batch importers need to be fixed. They currently assume a single parent of the parent item type.

Also added some of the button/select taglibs to provide more control over the look and feel in some cases.

comment:16 by Nicklas Nordborg, 13 years ago

(In [5664]) References #1153: Handling short read transcript sequence data

Updated web gui for extracts to handle the new parent item scheme. Some minor changes to the sample and physical bioassay gui also that among other things fixes some export issues.

comment:17 by Nicklas Nordborg, 13 years ago

(In [5665]) References #1153: Handling short read transcript sequence data

Edit dialog didn't load parent items correctly for existing items.

comment:18 by Nicklas Nordborg, 13 years ago

(In [5667]) References #1153: Handling short read transcript sequence data

Fixes the sample and extract batch importers so that they can import both types of parent items.

Display the 'tag' for an extract in listings and on the view page.

Added 'none' option to parent type filter.

by Nicklas Nordborg, 13 years ago

Third version: UML diagram for biomaterials (most of this has been implemented)

by Nicklas Nordborg, 13 years ago

Attachment: sequencing-rawdata-3.png added

Third version: UML diagram for physical bioassays+raw data (most of this has been implemented)

comment:19 by Nicklas Nordborg, 13 years ago

(In [5668]) References #1153: Handling short read transcript sequence data

Update UML diagrams for biomaterials and raw data.

comment:20 by Nicklas Nordborg, 13 years ago

(In [5685]) References #1153: Handling short read transcript sequence data

Simplified the design by only keeping DerivedBioAssay between PhysicalBioAssay and RawBioAssay. This should make batch importer, validation, etc. easier to implement and also provides the better backwards compatibility with array experiments. The web gui is usable but may need improvements in some cases.

comment:21 by Nicklas Nordborg, 13 years ago

(In [5688]) References #1153: Handling short read transcript sequence data

Fixes some test program failures. Started to re-enable all tests in the TestItemImporter (batch importers). Also need to create a batch DerivedBioAssayImporter that replaces the ScanImporter and add support for linking to extracts from raw bioassays.

comment:22 by Nicklas Nordborg, 13 years ago

(In [5690]) References #1153: Handling short read transcript sequence data

Fixed 'set owner' test so that it now uses a Tag instead of a Label.

comment:23 by Nicklas Nordborg, 13 years ago

(In [5692]) References #1153: Handling short read transcript sequence data

Fixed the 'Illumina raw data importer' plug-in. As a side-effect it now has support for attaching raw bioassays to more than one scan (derived bioassay). Actually, the attachment to scans was not fully implemented in the old version and didn't work as expected.

comment:24 by Nicklas Nordborg, 13 years ago

(In [5694]) References #1153: Handling short read transcript sequence data

Extended the Illumina test case to also create extracts to make sure that the raw bioassay -> extract linking is working as expected.

comment:25 by Nicklas Nordborg, 13 years ago

(In [5695]) References #1153: Handling short read transcript sequence data

Fixes isUsed() and getUsingItems() method for extract and derived bioassay.

comment:26 by Nicklas Nordborg, 13 years ago

(In [5696]) References #1153: Handling short read transcript sequence data

Implemented DerivedBioAssayImporter that replaces the scan importer. Added support for linking to extracts in the raw bioassay importer.

comment:27 by Nicklas Nordborg, 13 years ago

(In [5697]) References #1153: Handling short read transcript sequence data

Re-added image import functionality to ScanImporter.

comment:28 by Nicklas Nordborg, 13 years ago

(In [5720]) References #1153: Handling short read transcript sequence data

Delete extracts after test.

by Nicklas Nordborg, 13 years ago

by Nicklas Nordborg, 13 years ago

Attachment: sequencing-rawdata-4.png added

comment:29 by Nicklas Nordborg, 13 years ago

The last UML diagrams simplify the path between PhysicalBioAssay and RawBioAssay. Only the DerivedBioAssay remains and it can either represent data for the entire physical bioassay (if there is no link to an Extract) or data for a single extract (if there is a link).

comment:30 by Nicklas Nordborg, 13 years ago

(In [5727]) References #1153: Handling short read transcript sequence data

Display job and plug-in information for derived bioassays. Added 'Run analysis plug-in' to toolbar on single-item page (derived bioassay).

comment:31 by Nicklas Nordborg, 13 years ago

(In [5739]) References #1153: Handling short read transcript sequence data

Changed some names to be more generic and not so Illumina-specific.

comment:32 by Nicklas Nordborg, 13 years ago

(In [5740]) References #1153: Handling short read transcript sequence data

The old cached annotation snapshots can't be used since item type codes have changed for some items.

comment:33 by Nicklas Nordborg, 13 years ago

(In [5748]) References #1153: Handling short read transcript sequence data

The item overview should now be able to load the complete tree from all ends.

comment:34 by Nicklas Nordborg, 13 years ago

(In [5749]) References #1153: Handling short read transcript sequence data

Loading parent annotatable items should now be fixed so that the proper upstream path for loading Extract:s from PhysicalBioAssay is used.

comment:35 by Nicklas Nordborg, 13 years ago

(In [5750]) References #1153: Handling short read transcript sequence data

Selecting parent extract in raw bioassay edit dialog and derived bioassay edit dialog should now be working properly.

comment:36 by Nicklas Nordborg, 13 years ago

(In [5752]) References #1153: Handling short read transcript sequence data

Adding inverse link between extract->rawbioassay so that the list page can be used for filtering.

comment:37 by Nicklas Nordborg, 13 years ago

(In [5773]) References #1153: Handling short read transcript sequence data

  • Adds a 'Cufflinks' raw data type with five columns: coverage, fpkm, fpkm_lo, fpkm_hi and status
  • Define FPKM_TRACKING file type and MIME type.
  • Adds left(string, index) as a JEP formula so that we can parse out the chromosome from the 'locus' column in the tracking files.
  • Define two new configurations for the raw data importer that parses cufflinks isoform files.

comment:38 by Nicklas Nordborg, 13 years ago

(In [5807]) References #1153: Handling short read transcript sequence data

Lots of changes to item overview generation and validation. Most important are:

  • Subtype validation has been improved and hopefully it now works when validating both from the top (biosource) and bottom (experiment)
  • Parent extract validation of raw bioassay, derived bioassay and physical bioassay. Ensure that everything is properly linked (eg. matching extracts in all levels).

comment:39 by Nicklas Nordborg, 12 years ago

Resolution: fixed
Status: assignedclosed

comment:40 by Nicklas Nordborg, 12 years ago

(In [5832]) References #1153: Handling short read transcript sequence data

Batch importers incorrectly reported errors when adding extracts to other physical bioassay positions than '1'.

Note: See TracTickets for help on using tickets.