Opened 15 years ago

Closed 14 years ago

Last modified 14 years ago

#486 closed task (fixed)

Import raw data from the Illumina platform

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: minor Milestone: BASE 2.4
Component: coreplugins Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

A discussion about importing raw data from the Illumina platform was started on the mailing list. See:

http://www.mail-archive.com/basedb-users@lists.sourceforge.net/msg00500.html

Since it seems that the file format is not compatible with the existing import plugin a new plugin or something else is needed. One working solution (by Jeremy Davis-Turak) is to split the data file into one file for each hybridization with an R script and then use the regular import plugin.

Attachments (8)

Illumina conversion.txt (1.8 KB) - added by base 15 years ago.
R script to convert a multi-array Illumina .csv output file into multiple single-array files.
extended-properties.xml (5.2 KB) - added by base 15 years ago.
Extended properties file edited by Jeremy Davis-Turak to include reporter columns used by Illumina
gene_profile_Rat_sample.csv (54.3 KB) - added by base 14 years ago.
gene_profile_Rat_sample.2.csv (54.3 KB) - added by base 14 years ago.
Sample rat data.
RatRef_12_v1_11222119_A.sample.csv (37.6 KB) - added by base 14 years ago.
Sample rat annotation file.
gene_profile_human_sample.csv (13.9 KB) - added by base 14 years ago.
Sample human annotation file. Note the different number of columns per array (2) compared to the rat file, and the different number of arrays per chip (6, where rat has 12).
1401771138_A.sample.txt (2.4 KB) - added by base 14 years ago.
Example of a file resulting from the R script. Note that the column names are general.
illumina-raw-data.xml (1.8 KB) - added by Nicklas Nordborg 14 years ago.
This replaces the older raw-data-types.xml. It incorrectly set one column to string type. This file only contains the Illumina definition. Copy and paste it into your existing raw-data-types.xml file. The database must be manually changed if you have used the old file before.

Download all attachments as: .zip

Change History (18)

comment:1 Changed 15 years ago by Jari Häkkinen

Milestone: BASE 2.x+

Changed 15 years ago by base

Attachment: Illumina conversion.txt added

R script to convert a multi-array Illumina .csv output file into multiple single-array files.

Changed 15 years ago by base

Attachment: extended-properties.xml added

Extended properties file edited by Jeremy Davis-Turak to include reporter columns used by Illumina

comment:2 Changed 15 years ago by Jari Häkkinen

Milestone: BASE 2.x+BASE 2.4

comment:3 Changed 14 years ago by Nicklas Nordborg

Description: modified (diff)

comment:4 Changed 14 years ago by Nicklas Nordborg

Description: modified (diff)

Changed 14 years ago by base

Attachment: gene_profile_Rat_sample.csv added

Changed 14 years ago by base

Sample rat data.

Changed 14 years ago by base

Sample rat annotation file.

Changed 14 years ago by base

Sample human annotation file. Note the different number of columns per array (2) compared to the rat file, and the different number of arrays per chip (6, where rat has 12).

Changed 14 years ago by base

Attachment: 1401771138_A.sample.txt added

Example of a file resulting from the R script. Note that the column names are general.

comment:5 Changed 14 years ago by Nicklas Nordborg

Description: modified (diff)

comment:6 Changed 14 years ago by Nicklas Nordborg

From Jeremy Davis-Turak:

Here is a brief summary of what the data looks like:

1) Annotation data: CSV file. It's too bad that it's a CSV, because some of the fields contain commas!

2) Data: (header is on ~ line 8)

a) For each set of chips that are processed at the same time, there is one resulting file. Thus, if you did two rat chips (each of which has 12 arrays on them), you would have 24 arrays contained in one file.

b) Depending on the settings of the software at the time of scanning, you can have somewhere from 1-8 data columns per array (I don't know the exact range, but I know that it's variable).

c) The first column contains the probe IDs, the rest of them are data.

d) Each data column name is a concatenation of 3 things:

  1. The data type (i.e. 'AVG_Signal' or 'BEAD_STDEV')
  2. The chip number (10 digits)
  3. A capital letter indicating the position of the array on the chip (i.e. A-F for human, A-H for mouse, or A-L for rat.)

EXAMPLE: the first 8 columns in my rat file are:

AVG_Signal-1677718123_A
BEAD_STDEV-1677718123_A
Avg_NBEADS-1677718123_A	
Detection-1677718123_A	
AVG_Signal-1677718123_B	
BEAD_STDEV-1677718123_B	
Avg_NBEADS-1677718123_B	
Detection-1677718123_B

.... and a number of columns later, they transition smoothly to the next chip:

Avg_NBEADS-1677718123_L	
Detection-1677718123_L	
AVG_Signal-1677718142_A	
BEAD_STDEV-1677718142_A	

In my R script, you have to hard-code the number of data columns per array, and the number of arrays per chip.

comment:7 Changed 14 years ago by Nicklas Nordborg

Status: newassigned

comment:8 Changed 14 years ago by Nicklas Nordborg

An example file from the Tab2Mage project (http://tab2mage.sourceforge.net/examples/magetab/illumina) seems to include all 8 column types:

MIN_Signal
AVG_Signal
MAX_Signal
NARRAYS
ARRAY_STDEV
BEAD_STDEV
Avg_NBEADS
Detection

Here is the plan for implementation:

  1. Scan the data file until the line that starts with TargetID is found. This contains the column headers.
  1. Extract the column names and array names by splitting each column header on '-'. The last part is used as names on the raw bioassays and the first part is mapped to Illumina raw data properties.
  1. Optionally, validate that all arrays has the same number of columns.
  1. The user should be able to select one scan, one protocol and one software that all raw bioassays are associated with.
  1. There is no coordinate information, so it will not be possible to associate the raw bioassays with an array design unless we fake coordinates, for example: block=1, column=1, row=line number in file
  1. It should be possible to start the plug-in from the single-item view of an experiment. If so, all raw bioassays will be included in the experiment.

  1. The plug-in should also support the same options for number format, character set, error handling, etc. as the regular raw data importer.

comment:9 Changed 14 years ago by Nicklas Nordborg

Resolution: fixed
Status: assignedclosed

(In [3626]) Fixes #486: Import raw data from the Illumina platform

Changed 14 years ago by Nicklas Nordborg

Attachment: illumina-raw-data.xml added

This replaces the older raw-data-types.xml. It incorrectly set one column to string type. This file only contains the Illumina definition. Copy and paste it into your existing raw-data-types.xml file. The database must be manually changed if you have used the old file before.

comment:10 Changed 14 years ago by Nicklas Nordborg

(In [3636]) References #486: Added Jeremy Davis-Turak to credits.txt for helping out with example data files.

Note: See TracTickets for help on using tickets.