3.2.4: 2013-12-06

net.sf.basedb.util.gtf
Class GtfInputStream

java.lang.Object
  extended by java.io.InputStream
      extended by net.sf.basedb.util.gtf.GtfInputStream
All Implemented Interfaces:
Closeable

public class GtfInputStream
extends InputStream

Input stream implementation that reads from a GTF file and converts it to a simple tab-separated file with a single line of column headers. This is useful since it means that we can use the regular FlatFileParser and other tools for parsing the resulting stream. The first line in the file is used a template line. The first 8 columns are fixed. The 9th column contains attributes as key/value pairs, which are converted to additional columns in the output. The GTF specification require that gene_id and transcript_id are present, which means that the output will contain at least 10 columns. Subsequent lines are parsed in the same way and attributes are lined up with the first line. Note that any attributes that are not present in the first line are skipped. The parser also has an option to skip lines with a transcript_id+seqname that is not unique. Normally, a GTF file will contain multiple entries with the same id:s, but in most cases we are not interested in this when importing data to BASE. This option also remove the feature, start, end, score, strand and frame columns from the output.

Since:
3.0
Author:
Nicklas
Last modified
$Date: 2011-09-29 13:27:33 +0200 (Thu, 29 Sep 2011) $

Nested Class Summary
(package private) static class GtfInputStream.Attribute
           
 
Field Summary
private  Pattern ATTRIBUTE_PATTERN
           
private  GtfInputStream.Attribute[] attributes
           
private  byte[] buffer
           
private  Charset charset
           
private  int geneIdIndex
           
private  int index
           
private  int lineNum
           
private  InputStream master
           
private  BufferedReader reader
           
private  boolean skipRepeatedTranscriptIds
           
private  int transcriptIdIndex
           
private  Set<String> transcriptIds
           
 
Constructor Summary
GtfInputStream(InputStream master, String charset, boolean skipRepeatedTranscriptIds)
          Create a new input stream reading from the master.
 
Method Summary
private  StringBuffer appendLine(StringBuffer sb, String[] columns, GtfInputStream.Attribute[] attr)
          Append the first 8 columns to the buffer and then add all values from the attributes.
 int available()
           
 void close()
           
private  String[] getNextLine()
          Read the next line from the GTF file and split on tab character into 9 or 10 columns.
 int getNumLines()
          Get the number of lines parsed so far.
 int getNumUniqueTranscriptIds()
          Get the number of unique transcript ids found so far.
private  void init()
          Initialize the converter by reading the first line from the GTF file.
 boolean markSupported()
           
private  void parseAttributes(String template)
          Parse attributes from the given template string.
 int read()
           
 int read(byte[] b)
           
 int read(byte[] b, int off, int len)
           
private  byte[] readMore()
          Read more data from the GTF file.
 void reset()
           
 
Methods inherited from class java.io.InputStream
mark, skip
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

master

private final InputStream master

reader

private final BufferedReader reader

charset

private final Charset charset

ATTRIBUTE_PATTERN

private final Pattern ATTRIBUTE_PATTERN

buffer

private byte[] buffer

index

private int index

lineNum

private int lineNum

attributes

private GtfInputStream.Attribute[] attributes

geneIdIndex

private int geneIdIndex

transcriptIdIndex

private int transcriptIdIndex

skipRepeatedTranscriptIds

private final boolean skipRepeatedTranscriptIds

transcriptIds

private final Set<String> transcriptIds
Constructor Detail

GtfInputStream

public GtfInputStream(InputStream master,
                      String charset,
                      boolean skipRepeatedTranscriptIds)
               throws IOException
Create a new input stream reading from the master.

Parameters:
master - The master input stream
charset - The character set used in the file
skipRepeatedTranscriptIds - TRUE to skip lines with non-unique values for transcript_id+seqname
Throws:
IOException
Method Detail

read

public int read()
         throws IOException
Specified by:
read in class InputStream
Throws:
IOException

read

public int read(byte[] b)
         throws IOException
Overrides:
read in class InputStream
Throws:
IOException

read

public int read(byte[] b,
                int off,
                int len)
         throws IOException
Overrides:
read in class InputStream
Throws:
IOException

available

public int available()
              throws IOException
Overrides:
available in class InputStream
Throws:
IOException

close

public void close()
           throws IOException
Specified by:
close in interface Closeable
Overrides:
close in class InputStream
Throws:
IOException

markSupported

public boolean markSupported()
Overrides:
markSupported in class InputStream

reset

public void reset()
           throws IOException
Overrides:
reset in class InputStream
Throws:
IOException

getNumLines

public int getNumLines()
Get the number of lines parsed so far.


getNumUniqueTranscriptIds

public int getNumUniqueTranscriptIds()
Get the number of unique transcript ids found so far.


init

private void init()
           throws IOException
Initialize the converter by reading the first line from the GTF file. The attributes will be extracted and a header row is created with the 8 required columns + new colums for each attribute.

Throws:
IOException

readMore

private byte[] readMore()
                 throws IOException
Read more data from the GTF file. Typically one additional line is read and stored in the buffer. Do not call this method unless it is certain that the existing buffer has been completely read by the reader of this input stream.

Throws:
IOException

getNextLine

private String[] getNextLine()
                      throws IOException
Read the next line from the GTF file and split on tab character into 9 or 10 columns. If the line doesn't contain at least 9 columns an exception is thrown. If the 10th column exists it must be a comment column that starts with a # character.

Throws:
IOException

parseAttributes

private void parseAttributes(String template)
                      throws IOException
Parse attributes from the given template string. The first time this method is called all attributes are accepted and their order is remembered. Subsequent calls accept only values for the remembered attributes.

Throws:
IOException

appendLine

private StringBuffer appendLine(StringBuffer sb,
                                String[] columns,
                                GtfInputStream.Attribute[] attr)
Append the first 8 columns to the buffer and then add all values from the attributes.

Parameters:
sb - The buffer to append to
columns - The regular columns (must be at least 8)
attr - The attributes to add

3.2.4: 2013-12-06