Opened 17 years ago

Closed 17 years ago

Last modified 17 years ago

#486 closed task (fixed)

Import raw data from the Illumina platform

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: minor Milestone: BASE 2.4
Component: coreplugins Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

A discussion about importing raw data from the Illumina platform was started on the mailing list. See:

http://www.mail-archive.com/basedb-users@lists.sourceforge.net/msg00500.html

Since it seems that the file format is not compatible with the existing import plugin a new plugin or something else is needed. One working solution (by Jeremy Davis-Turak) is to split the data file into one file for each hybridization with an R script and then use the regular import plugin.

Attachments (8)

Illumina conversion.txt (1.8 KB ) - added by base 17 years ago.
R script to convert a multi-array Illumina .csv output file into multiple single-array files.
extended-properties.xml (5.2 KB ) - added by base 17 years ago.
Extended properties file edited by Jeremy Davis-Turak to include reporter columns used by Illumina
gene_profile_Rat_sample.csv (54.3 KB ) - added by base 17 years ago.
gene_profile_Rat_sample.2.csv (54.3 KB ) - added by base 17 years ago.
Sample rat data.
RatRef_12_v1_11222119_A.sample.csv (37.6 KB ) - added by base 17 years ago.
Sample rat annotation file.
gene_profile_human_sample.csv (13.9 KB ) - added by base 17 years ago.
Sample human annotation file. Note the different number of columns per array (2) compared to the rat file, and the different number of arrays per chip (6, where rat has 12).
1401771138_A.sample.txt (2.4 KB ) - added by base 17 years ago.
Example of a file resulting from the R script. Note that the column names are general.
illumina-raw-data.xml (1.8 KB ) - added by Nicklas Nordborg 17 years ago.
This replaces the older raw-data-types.xml. It incorrectly set one column to string type. This file only contains the Illumina definition. Copy and paste it into your existing raw-data-types.xml file. The database must be manually changed if you have used the old file before.

Download all attachments as: .zip

Change History (18)

comment:1 by Jari Häkkinen, 17 years ago

Milestone: BASE 2.x+

by base, 17 years ago

Attachment: Illumina conversion.txt added

R script to convert a multi-array Illumina .csv output file into multiple single-array files.

by base, 17 years ago

Attachment: extended-properties.xml added

Extended properties file edited by Jeremy Davis-Turak to include reporter columns used by Illumina

comment:2 by Jari Häkkinen, 17 years ago

Milestone: BASE 2.x+BASE 2.4

comment:3 by Nicklas Nordborg, 17 years ago

Description: modified (diff)

comment:4 by Nicklas Nordborg, 17 years ago

Description: modified (diff)

by base, 17 years ago

Attachment: gene_profile_Rat_sample.csv added

by base, 17 years ago

Sample rat data.

by base, 17 years ago

Sample rat annotation file.

by base, 17 years ago

Sample human annotation file. Note the different number of columns per array (2) compared to the rat file, and the different number of arrays per chip (6, where rat has 12).

by base, 17 years ago

Attachment: 1401771138_A.sample.txt added

Example of a file resulting from the R script. Note that the column names are general.

comment:5 by Nicklas Nordborg, 17 years ago

Description: modified (diff)

comment:6 by Nicklas Nordborg, 17 years ago

From Jeremy Davis-Turak:

Here is a brief summary of what the data looks like:

1) Annotation data: CSV file. It's too bad that it's a CSV, because some of the fields contain commas!

2) Data: (header is on ~ line 8)

a) For each set of chips that are processed at the same time, there is one resulting file. Thus, if you did two rat chips (each of which has 12 arrays on them), you would have 24 arrays contained in one file.

b) Depending on the settings of the software at the time of scanning, you can have somewhere from 1-8 data columns per array (I don't know the exact range, but I know that it's variable).

c) The first column contains the probe IDs, the rest of them are data.

d) Each data column name is a concatenation of 3 things:

  1. The data type (i.e. 'AVG_Signal' or 'BEAD_STDEV')
  2. The chip number (10 digits)
  3. A capital letter indicating the position of the array on the chip (i.e. A-F for human, A-H for mouse, or A-L for rat.)

EXAMPLE: the first 8 columns in my rat file are:

AVG_Signal-1677718123_A
BEAD_STDEV-1677718123_A
Avg_NBEADS-1677718123_A	
Detection-1677718123_A	
AVG_Signal-1677718123_B	
BEAD_STDEV-1677718123_B	
Avg_NBEADS-1677718123_B	
Detection-1677718123_B

.... and a number of columns later, they transition smoothly to the next chip:

Avg_NBEADS-1677718123_L	
Detection-1677718123_L	
AVG_Signal-1677718142_A	
BEAD_STDEV-1677718142_A	

In my R script, you have to hard-code the number of data columns per array, and the number of arrays per chip.

comment:7 by Nicklas Nordborg, 17 years ago

Status: newassigned

comment:8 by Nicklas Nordborg, 17 years ago

An example file from the Tab2Mage project (http://tab2mage.sourceforge.net/examples/magetab/illumina) seems to include all 8 column types:

MIN_Signal
AVG_Signal
MAX_Signal
NARRAYS
ARRAY_STDEV
BEAD_STDEV
Avg_NBEADS
Detection

Here is the plan for implementation:

  1. Scan the data file until the line that starts with TargetID is found. This contains the column headers.
  1. Extract the column names and array names by splitting each column header on '-'. The last part is used as names on the raw bioassays and the first part is mapped to Illumina raw data properties.
  1. Optionally, validate that all arrays has the same number of columns.
  1. The user should be able to select one scan, one protocol and one software that all raw bioassays are associated with.
  1. There is no coordinate information, so it will not be possible to associate the raw bioassays with an array design unless we fake coordinates, for example: block=1, column=1, row=line number in file
  1. It should be possible to start the plug-in from the single-item view of an experiment. If so, all raw bioassays will be included in the experiment.

  1. The plug-in should also support the same options for number format, character set, error handling, etc. as the regular raw data importer.

comment:9 by Nicklas Nordborg, 17 years ago

Resolution: fixed
Status: assignedclosed

(In [3626]) Fixes #486: Import raw data from the Illumina platform

by Nicklas Nordborg, 17 years ago

Attachment: illumina-raw-data.xml added

This replaces the older raw-data-types.xml. It incorrectly set one column to string type. This file only contains the Illumina definition. Copy and paste it into your existing raw-data-types.xml file. The database must be manually changed if you have used the old file before.

comment:10 by Nicklas Nordborg, 17 years ago

(In [3636]) References #486: Added Jeremy Davis-Turak to credits.txt for helping out with example data files.

Note: See TracTickets for help on using tickets.