Class GtfInputStream

  • All Implemented Interfaces:
    Closeable, AutoCloseable

    public class GtfInputStream
    extends InputStream
    Input stream implementation that reads from a GTF file and converts it to a simple tab-separated file with a single line of column headers. This is useful since it means that we can use the regular FlatFileParser and other tools for parsing the resulting stream. The first line in the file is used a template line. The first 8 columns are fixed. The 9th column contains attributes as key/value pairs, which are converted to additional columns in the output. The GTF specification require that gene_id and transcript_id are present, which means that the output will contain at least 10 columns. Subsequent lines are parsed in the same way and attributes are lined up with the first line. Note that any attributes that are not present in the first line are skipped. The parser also has an option to skip lines with a transcript_id+seqname that is not unique. Normally, a GTF file will contain multiple entries with the same id:s, but in most cases we are not interested in this when importing data to BASE. This option also remove the feature, start, end, score, strand and frame columns from the output. Lines that can't be split into at least 9 columns (eg. comment lines starting with #) are ignored and forwarded without modification.
    Since:
    3.0
    Author:
    Nicklas
    Last modified
    $Date: 2015-05-12 11:27:08 +0200 (ti, 12 maj 2015) $
    • Field Detail

      • charset

        private final Charset charset
      • ATTRIBUTE_PATTERN

        private final Pattern ATTRIBUTE_PATTERN
      • buffer

        private byte[] buffer
      • index

        private int index
      • lineNum

        private int lineNum
      • geneIdIndex

        private int geneIdIndex
      • transcriptIdIndex

        private int transcriptIdIndex
      • skipRepeatedTranscriptIds

        private final boolean skipRepeatedTranscriptIds
      • transcriptIds

        private final Set<String> transcriptIds
    • Constructor Detail

      • GtfInputStream

        public GtfInputStream​(InputStream master,
                              String charset,
                              boolean skipRepeatedTranscriptIds)
                       throws IOException
        Create a new input stream reading from the master.
        Parameters:
        master - The master input stream
        charset - The character set used in the file
        skipRepeatedTranscriptIds - TRUE to skip lines with non-unique values for transcript_id+seqname
        Throws:
        IOException
    • Method Detail

      • getNumLines

        public int getNumLines()
        Get the number of lines parsed so far.
      • getNumUniqueTranscriptIds

        public int getNumUniqueTranscriptIds()
        Get the number of unique transcript ids found so far.
      • readMore

        private byte[] readMore()
                         throws IOException
        Read more data from the GTF file. Typically one additional line is read and stored in the buffer. Do not call this method unless it is certain that the existing buffer has been completely read by the reader of this input stream.
        Throws:
        IOException
      • getNextLine

        private String[] getNextLine()
                              throws IOException
        Read the next line from the GTF file and split on tab character.
        Throws:
        IOException
      • parseAttributes

        private void parseAttributes​(String template)
                              throws IOException
        Parse attributes from the given template string. The first time this method is called all attributes are accepted and their order is remembered. Subsequent calls accept only values for the remembered attributes.
        Throws:
        IOException
      • appendLine

        private StringBuffer appendLine​(StringBuffer sb,
                                        String[] columns,
                                        GtfInputStream.Attribute[] attr)
        Append columns to the buffer and separate each with a tab. If attributes are given, the first 8 (or 2 if skipRepeatedTranscriptIds=true) columns are appended, then each of the attributes are appended. If no attributes are given, all columns are copied as they are.
        Parameters:
        sb - The buffer to append to
        columns - The regular columns (must be at least 8)
        attr - The attributes to add