Class GtfInputStream

java.lang.Object
java.io.InputStream
net.sf.basedb.util.gtf.GtfInputStream
All Implemented Interfaces:
Closeable, AutoCloseable

public class GtfInputStream extends InputStream
Input stream implementation that reads from a GTF file and converts it to a simple tab-separated file with a single line of column headers. This is useful since it means that we can use the regular FlatFileParser and other tools for parsing the resulting stream. The first line in the file is used a template line. The first 8 columns are fixed. The 9th column contains attributes as key/value pairs, which are converted to additional columns in the output. The GTF specification require that gene_id and transcript_id are present, which means that the output will contain at least 10 columns. Subsequent lines are parsed in the same way and attributes are lined up with the first line. Note that any attributes that are not present in the first line are skipped. The parser also has an option to skip lines with a transcript_id+seqname that is not unique. Normally, a GTF file will contain multiple entries with the same id:s, but in most cases we are not interested in this when importing data to BASE. This option also remove the feature, start, end, score, strand and frame columns from the output. Lines that can't be split into at least 9 columns (eg. comment lines starting with #) are ignored and forwarded without modification.
Since:
3.0
Author:
Nicklas
Last modified
$Date: 2015-05-12 11:27:08 +0200 (ti, 12 maj 2015) $
  • Field Details

    • master

      private final InputStream master
    • reader

      private final BufferedReader reader
    • charset

      private final Charset charset
    • ATTRIBUTE_PATTERN

      private final Pattern ATTRIBUTE_PATTERN
    • buffer

      private byte[] buffer
    • index

      private int index
    • lineNum

      private int lineNum
    • attributes

      private GtfInputStream.Attribute[] attributes
    • geneIdIndex

      private int geneIdIndex
    • transcriptIdIndex

      private int transcriptIdIndex
    • skipRepeatedTranscriptIds

      private final boolean skipRepeatedTranscriptIds
    • transcriptIds

      private final Set<String> transcriptIds
  • Constructor Details

    • GtfInputStream

      public GtfInputStream(InputStream master, String charset, boolean skipRepeatedTranscriptIds) throws IOException
      Create a new input stream reading from the master.
      Parameters:
      master - The master input stream
      charset - The character set used in the file
      skipRepeatedTranscriptIds - TRUE to skip lines with non-unique values for transcript_id+seqname
      Throws:
      IOException
  • Method Details

    • read

      public int read() throws IOException
      Specified by:
      read in class InputStream
      Throws:
      IOException
    • read

      public int read(byte[] b) throws IOException
      Overrides:
      read in class InputStream
      Throws:
      IOException
    • read

      public int read(byte[] b, int off, int len) throws IOException
      Overrides:
      read in class InputStream
      Throws:
      IOException
    • available

      public int available() throws IOException
      Overrides:
      available in class InputStream
      Throws:
      IOException
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Overrides:
      close in class InputStream
      Throws:
      IOException
    • markSupported

      public boolean markSupported()
      Overrides:
      markSupported in class InputStream
    • reset

      public void reset() throws IOException
      Overrides:
      reset in class InputStream
      Throws:
      IOException
    • getNumLines

      public int getNumLines()
      Get the number of lines parsed so far.
    • getNumUniqueTranscriptIds

      public int getNumUniqueTranscriptIds()
      Get the number of unique transcript ids found so far.
    • readMore

      private byte[] readMore() throws IOException
      Read more data from the GTF file. Typically one additional line is read and stored in the buffer. Do not call this method unless it is certain that the existing buffer has been completely read by the reader of this input stream.
      Throws:
      IOException
    • getNextLine

      private String[] getNextLine() throws IOException
      Read the next line from the GTF file and split on tab character.
      Throws:
      IOException
    • parseAttributes

      private void parseAttributes(String template) throws IOException
      Parse attributes from the given template string. The first time this method is called all attributes are accepted and their order is remembered. Subsequent calls accept only values for the remembered attributes.
      Throws:
      IOException
    • appendLine

      private StringBuffer appendLine(StringBuffer sb, String[] columns, GtfInputStream.Attribute[] attr)
      Append columns to the buffer and separate each with a tab. If attributes are given, the first 8 (or 2 if skipRepeatedTranscriptIds=true) columns are appended, then each of the attributes are appended. If no attributes are given, all columns are copied as they are.
      Parameters:
      sb - The buffer to append to
      columns - The regular columns (must be at least 8)
      attr - The attributes to add