public class FlatFileParser
extends java.lang.Object
Example
# Example of a parsable file, not actual format of a GenePix file section info Type=GenePix Results 1.3 DateTime=2002/09/04 13:59:48 Scanner=GenePix 4000B [83306] section data Block Column Row Name ID 1 1 1 "Ly68_Lymphocyte antigen 68" "M000205_01" 1 2 1 "Bag1_Bcl2-associated athanogene 1" "M000209_01" 1 3 1 "Rps16_Ribosomal protein S16" "M000213_01" 1 4 1 "Col4a1_Procollagen, type IV, alpha 1" "M000229_01" 1 5 1 "Ace_Angiotensin converting enzyme" "M000233_01" 1 6 1 "Cd5_CD5 antigen" "M000237_01" 1 7 1 "Psme1_Protease (prosome, macropain) 28 s"
How to use
The parsing is controlled by regular expressions. Start by
creating a new FlatFileParser
object. Use the
various set
methods to provide regular expression
used to match the data/headers.
Use the setInputStream(InputStream, String)
method to specify a file to parse, and parseHeaders()
to start the
parsing. Note! Even if you know that
the file doesn't contain any headers, you should always call this
method since the parser must initialize itself. If there are sections
in the file use nextSection()
first to control which section
you are parsing from.
When the headers have been found use the hasMoreData()
and nextData()
methods in a loop to read all
data from the section.
Example
FlatFileParser ffp = new FlatFileParser(); ffp.setHeaderRegexp(Pattern.compile("(.*)=(.*)")); ffp.setDataHeaderRegexp(Pattern.compile("Block\\tColumn\\tRow\\tName\\tID")); ffp.setDataSplitterRegexp(Pattern.compile("\\t")); ffp.setIgnoreRegexp(Pattern.compile("#.*")); ffp.setMinDataColumns(5); ffp.setMaxDataColumns(5); ffp.setInputStream(FileUtil.getInputStream(path_to_file), Config.getCharset()); ffp.parseHeaders(); for (int i = 0; i < ffp.getLineCount(); i++) { FlatFileParser.Line line = ffp.getLine(i); System.out.println(i+":"+line.type()+":"+line.line()); } int i = 0; while (ffp.hasMoreData()) { FlatFileParser.Data data = ffp.nextData(); System.out.println(i+":"+data.columns()+":"+data.line()); }
Mapping column values
With the FlatFileParser.Data
object you can only
access the data by column index (0-based) and all values are returned
as strings. Another approach is to use Mapper
:s. A mapper
takes a string template and inserts the values of the data columns
where you specify. Here are some example:
\1\ \row\ Row: \row\, Col:\col\ =2 * col('Radius')The result can be retrieved either as a string or as a numeric value. It is even possible to create expressions that does a calculation on the value before it is returned. See the
getMapper(String)
method for more information.Modifier and Type | Class and Description |
---|---|
static class |
FlatFileParser.Data
This class holds data about a line parsed by the
hasMoreData() method. |
static class |
FlatFileParser.Line
This class holds data about a line parsed by the
parseHeaders() method. |
static class |
FlatFileParser.LineType
Represents the type of a line matched or unmatched by the parser.
|
Modifier and Type | Field and Description |
---|---|
private java.util.regex.Pattern |
bofMarker
The regular expression for matching the beginning-of-file marker
|
private java.lang.String |
bofType
The value that was captured by the bofMarker pattern.
|
private java.util.List<java.lang.String> |
columnHeaders
List of the column names found by splitting the data header using the
data splitter regexp.
|
private java.util.regex.Pattern |
dataFooter
The regular expression for matching the data footer line.
|
private java.util.regex.Pattern |
dataHeader
The regular expression for matching the data header line.
|
private java.util.regex.Pattern |
dataSplitter
The regular expression for splitting a data line.
|
static int |
DEFAULT_MAX_UNKNOWN_LINES
The default value for the number of unknown lines
in a row that may be encountered by the
parseHeaders method before it
gives up. |
private boolean |
emptyIsNull
If
null should be returned for empty columns (instead of an empty string). |
private static java.util.regex.Pattern |
findColumn
Pattern used to find column mappings in a string, ie. abc \col\ def
|
private java.util.regex.Pattern |
header
The regular expression for matching a header line.
|
private java.util.Map<java.lang.String,java.lang.String> |
headers
Map of header lines parsed by the
parseHeaders() method. |
private java.util.regex.Pattern |
ignore
The regular expression for matching a comment line.
|
private int |
ignoredLines
Number of ignored lines in the
nextData() method. |
private boolean |
ignoreNonExistingColumns
If non-existing columns should be ignored (true) or result in an
exception (false)
|
private boolean |
keepSkippedLines
If unknown or ignored lines should be kept.
|
private java.util.List<FlatFileParser.Line> |
lines
List of lines parsed by the
parseHeaders() method. |
private int |
maxDataColumns
The maximun number of allowed data columns for a line to be considered
a data line.
|
private int |
maxUnknownLines
The maximum number of unkown lines to parse before giving up.
|
private int |
minDataColumns
The minimun number of allowed data columns for a line to be considered
a data line.
|
private FlatFileParser.Data |
nextData
The next available data line as parsed by the
hasMoreData()
method. |
private FlatFileParser.Line |
nextSection
The line that last matched the
section . |
private boolean |
nullIsNull
If
null should be returned for the string NULL (ignoring case)
or not. |
private java.text.NumberFormat |
numberFormat
The default number formatter to use for creating mappers.
|
private long |
parsedCharacters
The total number of parsed characters so far.
|
private int |
parsedDataLines
The number of data lines parsed in the current section so far.
|
private int |
parsedLines
The total number of lines parsed so far.
|
private java.io.BufferedReader |
reader
Reads from the given input stream
|
private java.util.regex.Pattern |
section
The regular expression for matching the fist line of a section.
|
private java.util.List<FlatFileParser.Line> |
skippedLines
List for keeping ignored and unknown lines in the
nextData() method. |
private InputStreamTracker |
tracker
For keeping track of the number of bytes parsed.
|
private boolean |
trimQuotes
If quotes should be trimmed from data values or not.
|
private int |
unknownLines
Number of unknown lines in the
nextData() method. |
private boolean |
useNullIfException
If
null should be returned if a (numeric) value can't be parsed. |
Constructor and Description |
---|
FlatFileParser()
Create a new
FlatFileParser object. |
Modifier and Type | Method and Description |
---|---|
private java.lang.String |
convertToNull(java.lang.String value) |
java.lang.Integer |
findColumnHeaderIndex(java.lang.String regex)
Find the index of a column header using a regular expression for pattern
matching.
|
java.lang.String |
getBofType()
Get the value captured by the BOF marker regular expression.
|
java.lang.Integer |
getColumnHeaderIndex(java.lang.String name)
Get the index of a column header with a given name.
|
java.util.List<java.lang.String> |
getColumnHeaders()
Get all column headers that were found by splitting the line matching the
setDataHeaderRegexp(Pattern) pattern using the
setDataSplitterRegexp(Pattern) pattern. |
java.text.NumberFormat |
getDefaultNumberFormat()
Get the default number format.
|
java.lang.String |
getHeader(java.lang.String name)
Get the value of the header with the specified name.
|
java.util.Set<java.lang.String> |
getHeaderNames()
Get the names of all headers found by the
parseHeaders()
method. |
int |
getIgnoredLines()
Get the number of lines that the last call to
nextData() or
hasMoreData() ignored because they matched the ignore regular
expression. |
FlatFileParser.Line |
getLine(int index)
Get the line with the specified number.
|
int |
getLineCount()
Get the number of lines that the
parseHeaders()
method parsed. |
java.util.List<FlatFileParser.Line> |
getLines()
Get the lines read by
parseHeaders() . |
Mapper |
getMapper(java.lang.String expression)
Get a mapper using the default number format.
|
Mapper |
getMapper(java.lang.String expression,
boolean nullIfException)
Get a mapper using the default number format.
|
Mapper |
getMapper(java.lang.String expression,
JepFunction... functions) |
Mapper |
getMapper(java.lang.String expression,
java.text.NumberFormat numberFormat)
Get a mapper using a specific number format.
|
Mapper |
getMapper(java.lang.String expression,
java.text.NumberFormat numberFormat,
boolean nullIfException) |
Mapper |
getMapper(java.lang.String expression,
java.text.NumberFormat numberFormat,
boolean nullIfException,
JepFunction... functions)
Create a mapper object that maps an expression string to a value.
|
int |
getNumSkippedLines()
Get the number of lines that the last call to
nextData() or
hasMoreData() ignored because they matched the ignore regular
expression or couldn't be interpreted as data lines. |
long |
getParsedBytes()
Get the number of parsed bytes so far.
|
long |
getParsedCharacters()
Get the number of parsed characters so far.
|
int |
getParsedDataLines()
Get the number of parsed data lines so far in the current section.
|
int |
getParsedLines()
Get the number of parsed lines so far.
|
java.util.List<FlatFileParser.Line> |
getSkippedLines()
Get lines that was skipped during the last call to
nextData() or
hasMoreData() . |
int |
getUnknownLines()
Get the number of lines that the last call to
nextData() or
hasMoreData() ignored because they couldn't be interpreted as
data lines. |
boolean |
hasMoreData()
Check if the input stream contains more data.
|
boolean |
hasMoreSections()
Check if the input stream contains more sections.
|
FlatFileParser.Data |
nextData()
Get the next available data.
|
FlatFileParser.Line |
nextSection()
Get the next line that matches the
section
regular expression. |
FlatFileParser.LineType |
parseHeaders()
Start parsing the input stream.
|
boolean |
parseToBof()
Parse the file until the beginning-of-file marker is found.
|
void |
setBofMarkerRegexp(java.util.regex.Pattern regexp)
Set a regular expression that maches a beginning-of-file
marker.
|
void |
setDataFooterRegexp(java.util.regex.Pattern regexp)
Set a regular expression that can be matched against a
data footer.
|
void |
setDataHeaderRegexp(java.util.regex.Pattern regexp)
Set a regular expression that can be matched against the data
header.
|
void |
setDataSplitterRegexp(java.util.regex.Pattern regexp)
Set a regular expression that is used to split a data line into
columns.
|
void |
setDefaultNumberFormat(java.text.NumberFormat numberFormat)
Set the default number format to use when creating mappers.
|
void |
setHeaderRegexp(java.util.regex.Pattern regexp)
Set a regular expression that can be matched against a header.
|
void |
setIgnoreNonExistingColumns(boolean ignoreNonExistingColumns)
Specify if trying to create a mapper with one of the
getMapper(String)
methods for an expression which references a non-existing column should
result in an exception or be ignored. |
void |
setIgnoreRegexp(java.util.regex.Pattern regexp)
Set a regular expression that is used to match a line that should
be ignored.
|
void |
setInputStream(java.io.InputStream in,
java.lang.String charsetName)
Set the input stream that will be parsed.
|
void |
setKeepSkippedLines(boolean keep)
If the
nextData() and hasMoreData() methods should
keep information of lines that was skipped because they matched the
ignore pattern or could be interpreted as data lines. |
void |
setMaxDataColumns(int columns)
Set the maximum number of columns a data line can contain in
order for it to be counted as a data line.
|
void |
setMaxUnknownLines(int lines)
The number of unknown lines in a row that can be parsed by
the
parseHeaders method before it gives
up. |
void |
setMinDataColumns(int columns)
Set the minimum number of columns a data line must contain in
order for it to be counted as a data line.
|
void |
setSectionRegexp(java.util.regex.Pattern regexp)
Set a regular expression that can be matched against the section
line.
|
void |
setTrimQuotes(boolean trimQuotes)
Set if quotes around each data value should be removed or not.
|
void |
setUseNullIfEmpty(boolean emptyIsNull)
Specify if
null values should be returned instead of empty strings
for columns that doesn't contain any value. |
void |
setUseNullIfException(boolean useNullIfException)
Specify if
null should be returned if a (numeric)
value can't be parsed. |
void |
setUseNullIfNull(boolean nullIsNull)
Specify if
null values should be returned for strings
having the value "NULL" (ignoring case). |
java.lang.String[] |
trimQuotes(java.lang.String[] columns)
Remove enclosing quotes (" or ') around all columns.
|
public static final int DEFAULT_MAX_UNKNOWN_LINES
parseHeaders
method before it
gives up.setMaxUnknownLines
,
Constant Field Valuesprivate static final java.util.regex.Pattern findColumn
private java.io.BufferedReader reader
private InputStreamTracker tracker
private java.util.regex.Pattern bofMarker
private java.util.regex.Pattern header
private java.util.regex.Pattern section
private java.util.regex.Pattern dataHeader
private java.util.regex.Pattern dataSplitter
private boolean trimQuotes
private java.util.regex.Pattern dataFooter
private int minDataColumns
private int maxDataColumns
private java.util.regex.Pattern ignore
private int maxUnknownLines
private boolean emptyIsNull
null
should be returned for empty columns (instead of an empty string).private boolean useNullIfException
null
should be returned if a (numeric) value can't be parsed.private boolean ignoreNonExistingColumns
private boolean nullIsNull
null
should be returned for the string NULL (ignoring case)
or not.private java.text.NumberFormat numberFormat
private java.lang.String bofType
private java.util.List<FlatFileParser.Line> lines
parseHeaders()
method.private int parsedLines
private long parsedCharacters
private int parsedDataLines
private java.util.Map<java.lang.String,java.lang.String> headers
parseHeaders()
method.
The map contains name -> value pairsprivate java.util.List<java.lang.String> columnHeaders
private FlatFileParser.Line nextSection
section
.private FlatFileParser.Data nextData
hasMoreData()
method.private int ignoredLines
nextData()
method.private int unknownLines
nextData()
method.private boolean keepSkippedLines
getSkippedLines()
private java.util.List<FlatFileParser.Line> skippedLines
nextData()
method.public void setBofMarkerRegexp(java.util.regex.Pattern regexp)
parseToBof()
(can also be invoked manually).
The regular expression may contain a single capturing group. The
matched value is returned by getBofType()
.
regexp
- A regular expressionpublic void setHeaderRegexp(java.util.regex.Pattern regexp)
"Type=GenePix Results 1.3" "DateTime=2002/09/04 13:59:48"To match this we can use the following regular expression:
"(.*)=(.*)"
.regexp
- A regular expressionpublic void setSectionRegexp(java.util.regex.Pattern regexp)
[FileInformation]To match this we can use the following regular expression:
section (.*)
. This will match to anything
that starts with "section ". The section name will be in the
capturing group.regexp
- A regular expressionpublic void setDataHeaderRegexp(java.util.regex.Pattern regexp)
"Block"{tab}"Column"{tab}"Row"{tab}"Name"{tab}"ID" ...and so onTo match this we can use the following regular expression:
"(.*?)"(\t"(.*?)")
. This will match to anything
that has at least two columns. We could also be more specific and use:
"Block"\t"Column"\t"Row"\t"Name"\t"ID"...
regexp
- A regular expressionpublic void setDataSplitterRegexp(java.util.regex.Pattern regexp)
\t
. This regular expression is
also used to split the data header line into column names, which can then be used
in the getMapper(String)
method.regexp
- A regular expressionsetMinDataColumns
,
setMaxDataColumns
public void setTrimQuotes(boolean trimQuotes)
trimQuotes
- TRUE to remove quotes, FALSE to keep thempublic void setMinDataColumns(int columns)
columns
- The minimum number of columnspublic void setMaxDataColumns(int columns)
columns
- The maximum number of columns, or 0 for an
unlimited number, or -1 to disable counting the number of columnspublic void setDataFooterRegexp(java.util.regex.Pattern regexp)
hasMoreData
method it will exit and no more data will be returned.regexp
- A regular expressionpublic void setIgnoreRegexp(java.util.regex.Pattern regexp)
\#.*
regexp
- A regular expressionpublic void setMaxUnknownLines(int lines)
parseHeaders
method before it gives
up. The default value is specified by {#link #DEFAULT_MAX_UNKNOWN_LINES}.
This value is ignored while parsing data.lines
- The number of linespublic void setUseNullIfEmpty(boolean emptyIsNull)
null
values should be returned instead of empty strings
for columns that doesn't contain any value.emptyIsNull
- TRUE to return null, FALSE to return an empty stringpublic void setUseNullIfNull(boolean nullIsNull)
null
values should be returned for strings
having the value "NULL" (ignoring case).nullIsNull
- TRUE to return null, FALSE to return the original string valuepublic void setKeepSkippedLines(boolean keep)
nextData()
and hasMoreData()
methods should
keep information of lines that was skipped because they matched the
ignore pattern or could be interpreted as data lines. The default is
FALSE. The number of lines that was skipped is always available regardless
of this setting.keep
- TRUE to keep line information, FALSE to notgetSkippedLines()
,
getIgnoredLines()
,
getUnknownLines()
,
getNumSkippedLines()
public void setInputStream(java.io.InputStream in, java.lang.String charsetName)
in
- The InputStream
charsetName
- The name of the character set to use when parsing
the file, or null to use the default charset specified by
Config.getCharset()
public boolean parseToBof() throws java.io.IOException
setBofMarkerRegexp(Pattern)
or if the parsing of the file has already started, this method call is
ignored.java.io.IOException
public java.lang.String getBofType()
public FlatFileParser.LineType parseHeaders() throws java.io.IOException
section
regular expression?
header
regular expression?
data header
regular expression?
comment
regular expression?
data
regular expression into the appropriate number of columns?
FlatFileParser.LineType.UNKNOWN
and processing is continued
with the next line. If too many unkown lines in a row has been found
the method also returns. This should be considered as a failure to
parse the specified file.
The method returns the type of the last line that was parsed as follows:
FlatFileParser.LineType.SECTION
: The last line
was a section. Header, data header or data may
follow this line.
FlatFileParser.LineType.DATA_HEADER
: The last line
was the data header. It is expected that data should
follow.
FlatFileParser.LineType.DATA
: The last line
was a data line. More data may follow.
FlatFileParser.LineType.UNKNOWN
: The last line
was of unknown format. The file could not be parsed.
FlatFileParser.LineType
of the last parsed linejava.io.IOException
- If reading the file fails.private java.lang.String convertToNull(java.lang.String value)
public java.util.Set<java.lang.String> getHeaderNames()
parseHeaders()
method. To get the value of a header, use the getHeader(String)
method.public java.lang.String getHeader(java.lang.String name)
parseHeaders()
has
been completed.name
- The name of the headergetLine(int)
public int getLineCount()
parseHeaders()
method parsed.public FlatFileParser.Line getLine(int index)
parseHeaders()
has
been completed.index
- The line number, starting at 0Line
objectgetHeader(String)
public java.util.List<FlatFileParser.Line> getLines()
parseHeaders()
.public java.util.List<java.lang.String> getColumnHeaders()
setDataHeaderRegexp(Pattern)
pattern using the
setDataSplitterRegexp(Pattern)
pattern. This method should only be called
after parseHeaders()
has been called.public java.lang.Integer getColumnHeaderIndex(java.lang.String name)
parseHeaders()
has been called. If more than one header
with the same name exists the index of the first is returned.name
- The name of the column headerfindColumnHeaderIndex(String)
public java.lang.Integer findColumnHeaderIndex(java.lang.String regex)
parseHeaders()
has been called. If more than one header matches the regular expression
only the first one found is returned.regex
- The regular expression used to match the header namesgetColumnHeaderIndex(String)
public void setDefaultNumberFormat(java.text.NumberFormat numberFormat)
numberFormat
- The number format to use, or null to parse
numbers with Float.valueOf or Double.valueOfgetMapper(String)
,
getMapper(String, NumberFormat)
public java.text.NumberFormat getDefaultNumberFormat()
public void setUseNullIfException(boolean useNullIfException)
null
should be returned if a (numeric)
value can't be parsed. If this setting is set to TRUE all mappers
created by one of the getMapper(String)
methods are wrapped
in a NullIfExceptionMapper
. It is not possible to log
or get information about the exception.useNullIfException
- TRUE to return null, FALSE to throw an exceptionpublic void setIgnoreNonExistingColumns(boolean ignoreNonExistingColumns)
getMapper(String)
methods for an expression which references a non-existing column should
result in an exception or be ignored.ignoreNonExistingColumns
- TRUE to ignore, or FALSE to throw an exceptionpublic Mapper getMapper(java.lang.String expression)
getMapper(String, NumberFormat, boolean)
public Mapper getMapper(java.lang.String expression, JepFunction... functions)
public Mapper getMapper(java.lang.String expression, boolean nullIfException)
getMapper(String, NumberFormat, boolean)
public Mapper getMapper(java.lang.String expression, java.text.NumberFormat numberFormat)
getMapper(String, NumberFormat, boolean)
public Mapper getMapper(java.lang.String expression, java.text.NumberFormat numberFormat, boolean nullIfException)
getMapper(String, NumberFormat, boolean, JepFunction...)
public Mapper getMapper(java.lang.String expression, java.text.NumberFormat numberFormat, boolean nullIfException, JepFunction... functions)
\1\ \row\ Row: \row\, Col:\col\It is also possible to use expressions that are evaluated dynamically.
=2 * col('Radius')If no column that is matching the exact name is found the placeholder is interpreted as a regular expression which is checked against each of the column headers. In all cases, the first column header found is used if there are multiple matches.
If the expression is null, a mapper returning en empty string is returned,
unless the setUseNullIfEmpty(boolean)
has been activated. In that
case the mapper returns null.
expression
- The string containing the mapping expressionnumberFormat
- The number format the mapper should use for
parsing numbers, or null to use Float.valueOf or Double.valueOfnullIfException
- TRUE to return a null value instead of throwing
an exception when a value can't be parsed.functions
- Optional array with Jep functions that should be
included in the parserpublic boolean hasMoreData() throws java.io.IOException
data footer
regular expression?section
regular expression?ignore
regular
expression?data
regular expression into the appropriate number of columns?nextSection
method. If the third check is true, the line is ignored and the processing
continues with the next line. If the fourth check is true, TRUE is
returned and the data may be retrieved with the nextData
method.java.io.IOException
- If there is an error reading from the input streamnextData
public java.lang.String[] trimQuotes(java.lang.String[] columns)
columns
- The columnspublic int getParsedLines()
public int getParsedDataLines()
public long getParsedCharacters()
getParsedBytes()
public long getParsedBytes()
getParsedCharacters()
public FlatFileParser.Data nextData() throws java.io.IOException
Data
object, or null if there is no more datajava.io.IOException
- If the is an error reading from the input stream.hasMoreData
public int getIgnoredLines()
nextData()
or
hasMoreData()
ignored because they matched the ignore regular
expression.setIgnoreRegexp(Pattern)
,
setKeepSkippedLines(boolean)
public int getUnknownLines()
nextData()
or
hasMoreData()
ignored because they couldn't be interpreted as
data lines.setKeepSkippedLines(boolean)
public int getNumSkippedLines()
nextData()
or
hasMoreData()
ignored because they matched the ignore regular
expression or couldn't be interpreted as data lines.getIgnoredLines()
,
getUnknownLines()
,
getSkippedLines()
public java.util.List<FlatFileParser.Line> getSkippedLines()
nextData()
or
hasMoreData()
. The list is only available if the setKeepSkippedLines(boolean)
has been set to true (default is false).setKeepSkippedLines(boolean)
public boolean hasMoreSections() throws java.io.IOException
section
regular expression. The
parser will continue util a section line is found or end of file
is reached. If the metod return TRUE the section may be retrived with
the nextSection()
method. If the section
regular expression isn't specified
the method returns FALSE and won't parse any line.java.io.IOException
- If there is an error reading from the input streamnextData()
public FlatFileParser.Line nextSection() throws java.io.IOException
section
regular expression.java.io.IOException
hasMoreSections()