#486 closed task (fixed)
Import raw data from the Illumina platform
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | minor | Milestone: | BASE 2.4 |
Component: | coreplugins | Version: | |
Keywords: | Cc: |
Description (last modified by )
A discussion about importing raw data from the Illumina platform was started on the mailing list. See:
http://www.mail-archive.com/basedb-users@lists.sourceforge.net/msg00500.html
Since it seems that the file format is not compatible with the existing import plugin a new plugin or something else is needed. One working solution (by Jeremy Davis-Turak) is to split the data file into one file for each hybridization with an R script and then use the regular import plugin.
Attachments (8)
Change History (18)
comment:1 by , 18 years ago
Milestone: | → BASE 2.x+ |
---|
by , 18 years ago
Attachment: | Illumina conversion.txt added |
---|
by , 18 years ago
Attachment: | extended-properties.xml added |
---|
Extended properties file edited by Jeremy Davis-Turak to include reporter columns used by Illumina
comment:2 by , 17 years ago
Milestone: | BASE 2.x+ → BASE 2.4 |
---|
comment:3 by , 17 years ago
Description: | modified (diff) |
---|
comment:4 by , 17 years ago
Description: | modified (diff) |
---|
by , 17 years ago
Attachment: | gene_profile_Rat_sample.csv added |
---|
by , 17 years ago
Attachment: | gene_profile_human_sample.csv added |
---|
Sample human annotation file. Note the different number of columns per array (2) compared to the rat file, and the different number of arrays per chip (6, where rat has 12).
by , 17 years ago
Attachment: | 1401771138_A.sample.txt added |
---|
Example of a file resulting from the R script. Note that the column names are general.
comment:5 by , 17 years ago
Description: | modified (diff) |
---|
comment:6 by , 17 years ago
From Jeremy Davis-Turak:
Here is a brief summary of what the data looks like:
1) Annotation data: CSV file. It's too bad that it's a CSV, because some of the fields contain commas!
2) Data: (header is on ~ line 8)
a) For each set of chips that are processed at the same time, there is
one resulting file. Thus, if you did two rat chips (each of which has
12 arrays on them), you would have 24 arrays contained in one file.
b) Depending on the settings of the software at the time of scanning, you can have somewhere from 1-8 data columns per array (I don't know the exact range, but I know that it's variable).
c) The first column contains the probe IDs, the rest of them are data.
d) Each data column name is a concatenation of 3 things:
- The data type (i.e. 'AVG_Signal' or 'BEAD_STDEV')
- The chip number (10 digits)
- A capital letter indicating the position of the array on the chip (i.e. A-F for human, A-H for mouse, or A-L for rat.)
EXAMPLE: the first 8 columns in my rat file are:
AVG_Signal-1677718123_A BEAD_STDEV-1677718123_A Avg_NBEADS-1677718123_A Detection-1677718123_A AVG_Signal-1677718123_B BEAD_STDEV-1677718123_B Avg_NBEADS-1677718123_B Detection-1677718123_B
.... and a number of columns later, they transition smoothly to the next chip:
Avg_NBEADS-1677718123_L Detection-1677718123_L AVG_Signal-1677718142_A BEAD_STDEV-1677718142_A
In my R script, you have to hard-code the number of data columns per array, and the number of arrays per chip.
comment:7 by , 17 years ago
Status: | new → assigned |
---|
comment:8 by , 17 years ago
An example file from the Tab2Mage project (http://tab2mage.sourceforge.net/examples/magetab/illumina) seems to include all 8 column types:
MIN_Signal AVG_Signal MAX_Signal NARRAYS ARRAY_STDEV BEAD_STDEV Avg_NBEADS Detection
Here is the plan for implementation:
- Scan the data file until the line that starts with TargetID is found. This contains the column headers.
- Extract the column names and array names by splitting each column header on '-'. The last part is used as names on the raw bioassays and the first part is mapped to Illumina raw data properties.
- Optionally, validate that all arrays has the same number of columns.
- The user should be able to select one scan, one protocol and one software that all raw bioassays are associated with.
- There is no coordinate information, so it will not be possible to associate the raw bioassays with an array design unless we fake coordinates, for example: block=1, column=1, row=line number in file
- It should be possible to start the plug-in from the single-item view of an experiment. If so, all raw bioassays will be included in the experiment.
- The plug-in should also support the same options for number format, character set, error handling, etc. as the regular raw data importer.
comment:9 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
by , 17 years ago
Attachment: | illumina-raw-data.xml added |
---|
This replaces the older raw-data-types.xml. It incorrectly set one column to string type. This file only contains the Illumina definition. Copy and paste it into your existing raw-data-types.xml file. The database must be manually changed if you have used the old file before.
R script to convert a multi-array Illumina .csv output file into multiple single-array files.