Package net.sf.basedb.util.gtf
Class GtfInputStream
java.lang.Object
java.io.InputStream
net.sf.basedb.util.gtf.GtfInputStream
- All Implemented Interfaces:
Closeable
,AutoCloseable
Input stream implementation that reads from a GTF file and converts it to
a simple tab-separated file with a single line of column headers. This is
useful since it means that we can use the regular
FlatFileParser
and other tools for parsing the resulting stream. The first line in the
file is used a template line. The first 8 columns are fixed. The 9th column
contains attributes as key/value pairs, which are converted to additional
columns in the output. The GTF specification require that gene_id
and transcript_id
are present, which means that the output will
contain at least 10 columns. Subsequent lines are parsed in the same way and
attributes are lined up with the first line. Note that any attributes
that are not present in the first line are skipped. The parser also has an
option to skip lines with a transcript_id+seqname
that is not unique.
Normally, a GTF file will contain multiple entries with the same id:s, but
in most cases we are not interested in this when importing data to BASE.
This option also remove the feature, start, end, score, strand and frame
columns from the output. Lines that can't be split into at least 9 columns
(eg. comment lines starting with #) are ignored and forwarded without modification.- Since:
- 3.0
- Author:
- Nicklas
- Last modified
- $Date: 2015-05-12 11:27:08 +0200 (ti, 12 maj 2015) $
-
Nested Class Summary
-
Field Summary
Modifier and TypeFieldDescriptionprivate final Pattern
private GtfInputStream.Attribute[]
private byte[]
private final Charset
private int
private int
private int
private final InputStream
private final BufferedReader
private final boolean
private int
-
Constructor Summary
ConstructorDescriptionGtfInputStream
(InputStream master, String charset, boolean skipRepeatedTranscriptIds) Create a new input stream reading from the master. -
Method Summary
Modifier and TypeMethodDescriptionprivate StringBuffer
appendLine
(StringBuffer sb, String[] columns, GtfInputStream.Attribute[] attr) Append columns to the buffer and separate each with a tab.int
void
close()
private String[]
Read the next line from the GTF file and split on tab character.int
Get the number of lines parsed so far.int
Get the number of unique transcript ids found so far.boolean
private void
parseAttributes
(String template) Parse attributes from the given template string.int
read()
int
read
(byte[] b) int
read
(byte[] b, int off, int len) private byte[]
readMore()
Read more data from the GTF file.void
reset()
Methods inherited from class java.io.InputStream
mark, nullInputStream, readAllBytes, readNBytes, readNBytes, skip, skipNBytes, transferTo
-
Field Details
-
master
-
reader
-
charset
-
ATTRIBUTE_PATTERN
-
buffer
private byte[] buffer -
index
private int index -
lineNum
private int lineNum -
attributes
-
geneIdIndex
private int geneIdIndex -
transcriptIdIndex
private int transcriptIdIndex -
skipRepeatedTranscriptIds
private final boolean skipRepeatedTranscriptIds -
transcriptIds
-
-
Constructor Details
-
GtfInputStream
public GtfInputStream(InputStream master, String charset, boolean skipRepeatedTranscriptIds) throws IOException Create a new input stream reading from the master.- Parameters:
master
- The master input streamcharset
- The character set used in the fileskipRepeatedTranscriptIds
- TRUE to skip lines with non-unique values for transcript_id+seqname- Throws:
IOException
-
-
Method Details
-
read
- Specified by:
read
in classInputStream
- Throws:
IOException
-
read
- Overrides:
read
in classInputStream
- Throws:
IOException
-
read
- Overrides:
read
in classInputStream
- Throws:
IOException
-
available
- Overrides:
available
in classInputStream
- Throws:
IOException
-
close
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Overrides:
close
in classInputStream
- Throws:
IOException
-
markSupported
public boolean markSupported()- Overrides:
markSupported
in classInputStream
-
reset
- Overrides:
reset
in classInputStream
- Throws:
IOException
-
getNumLines
public int getNumLines()Get the number of lines parsed so far. -
getNumUniqueTranscriptIds
public int getNumUniqueTranscriptIds()Get the number of unique transcript ids found so far. -
readMore
Read more data from the GTF file. Typically one additional line is read and stored in the buffer. Do not call this method unless it is certain that the existing buffer has been completely read by the reader of this input stream.- Throws:
IOException
-
getNextLine
Read the next line from the GTF file and split on tab character.- Throws:
IOException
-
parseAttributes
Parse attributes from the given template string. The first time this method is called all attributes are accepted and their order is remembered. Subsequent calls accept only values for the remembered attributes.- Throws:
IOException
-
appendLine
Append columns to the buffer and separate each with a tab. If attributes are given, the first 8 (or 2 if skipRepeatedTranscriptIds=true) columns are appended, then each of the attributes are appended. If no attributes are given, all columns are copied as they are.- Parameters:
sb
- The buffer to append tocolumns
- The regular columns (must be at least 8)attr
- The attributes to add
-