net.sf.basedb.util.parser.FlatFileParser

public class FlatFileParser extends Object

This class can be used to parse data from flat text files and from Excel workbooks in xlsx format. If the file is a text file it must follow a few simple rules:

Data must be organised into columns, with one record per line
Each data column must be separated by some special character or character sequence not occuring in the data, for example a tab or a comma. Data in fixed-size columns cannot be parsed.
Data may optionally be preceeded by a data header, ie. the names of the columns
The data header may optionally be preceeded by file headers. A file header is something that can be split in a name-value pair.
The file may contain comments, which are ignored by the parser
The file contain section where each section can contain a header and/or a data part

Example

# Example of a parsable file, not actual format of a GenePix file
section info
Type=GenePix Results 1.3
DateTime=2002/09/04 13:59:48
Scanner=GenePix 4000B [83306]

section data
Block   Column  Row     Name    ID
1       1       1       "Ly68_Lymphocyte antigen 68"    "M000205_01"
1       2       1       "Bag1_Bcl2-associated athanogene 1"     "M000209_01"
1       3       1       "Rps16_Ribosomal protein S16"   "M000213_01"
1       4       1       "Col4a1_Procollagen, type IV, alpha 1"  "M000229_01"
1       5       1       "Ace_Angiotensin converting enzyme"     "M000233_01"
1       6       1       "Cd5_CD5 antigen"       "M000237_01"
1       7       1       "Psme1_Protease (prosome, macropain) 28 s"

If the file is an Excel file the first sheet is automatically converted to a tab-separated text file by default. To use a different sheet call the setExcelSheet(String) method with the name or index of the sheet. Initial parsing and regular expression matching is always done against the text representation of the selected sheet. Note that empty cells on the top and left are usually cut away. When retrieving values via the FlatFileParser.Data class it will typically go directly to the mapped cell from the Excel sheet and get the value, which means that numeric and date values doesn't have to be converted to and from strings if not needed. For example, if a date value is requested and the mapped cell is date that will be used as it is, but it the mapped cell is a string, the same parsers that are used for CSV files are used to convert the string to a date.

How to use
The parsing is controlled by regular expressions. Start by creating a new FlatFileParser object. Use the various set methods to provide regular expression used to match the data/headers.

Use the setInputStream(InputStream, String) method to specify a file to parse, and parseHeaders() to start the parsing. Note! Even if you know that the file doesn't contain any headers, you should always call this method since the parser must initialize itself. If there are sections in the file use nextSection() first to control which section you are parsing from.

When the headers have been found use the hasMoreData() and nextData() methods in a loop to read all data from the section.

Example

FlatFileParser ffp = new FlatFileParser();
ffp.setHeaderRegexp(Pattern.compile("(.*)=(.*)"));
ffp.setDataHeaderRegexp(Pattern.compile("Block\\tColumn\\tRow\\tName\\tID"));
ffp.setDataSplitterRegexp(Pattern.compile("\\t"));
ffp.setIgnoreRegexp(Pattern.compile("#.*"));
ffp.setMinDataColumns(5);
ffp.setMaxDataColumns(5);
ffp.setInputStream(FileUtil.getInputStream(path_to_file), Config.getCharset());
ffp.parseHeaders();
for (int i = 0; i < ffp.getLineCount(); i++)
{
   FlatFileParser.Line line = ffp.getLine(i);
   System.out.println(i+":"+line.type()+":"+line.line());
}
int i = 0;
while (ffp.hasMoreData())
{
   FlatFileParser.Data data = ffp.nextData();
   System.out.println(i+":"+data.columns()+":"+data.line());
}

Mapping column values
With the FlatFileParser.Data object you can only access the data by column index (0-based). Another approach is to use Mapper:s. A mapper takes a string template and inserts the values of the data columns where you specify. Here are some example:

\1\
\row\
Row: \row\, Col:\col\
=2 * col('Radius')

The result can be retrieved either as a string, as a numeric value or as a date. It is even possible to create expressions that does a calculation on the value before it is returned. See the getMapper(String) method for more information.

Version:: 2.0
Author:: Nicklas, Enell
Last modified: $Date$

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

FlatFileParser.Data

This class holds data about a line parsed by the hasMoreData() method.

(package private) static class

FlatFileParser.ExcelData

Subclass that is used to return data when the source file is an Excel file.

static class

FlatFileParser.Line

This class holds data about a line parsed by the parseHeaders() method.

static enum

FlatFileParser.LineType

Represents the type of a line matched or unmatched by the parser.
Field Summary

Fields

Modifier and Type

Field

Description

private Pattern

bofMarker

The regular expression for matching the beginning-of-file marker

private String

bofType

The value that was captured by the bofMarker pattern.

private List<String>

columnHeaders

List of the column names found by splitting the data header using the data splitter regexp.

private Pattern

dataFooter

The regular expression for matching the data footer line.

private Pattern

dataHeader

The regular expression for matching the data header line.

private Pattern

dataSplitter

The regular expression for splitting a data line.

private Formatter<Date>

dateFormat

The default date formatter to use when creating mappers.

static final int

DEFAULT_MAX_UNKNOWN_LINES

The default value for the number of unknown lines in a row that may be encountered by the parseHeaders method before it gives up.

private boolean

emptyIsNull

If null should be returned for empty columns (instead of an empty string).

private int

excelParsedLinesOffset

The value of the parsedLines when the current Excel sheet is parsed.

private XlsxToCsvUtil.SheetInfo

excelSheet

Excel sheet that is currently being parsed.

private String

excelSheetName

Name of the Excel sheet to parse if the file is an Excel file.

private XlsxToCsvUtil

excelWorkbook

Excel workbook that has been loaded.

private static final Pattern

findColumn

Pattern used to find column mappings in a string, ie. abc \col\ def

private Pattern

header

The regular expression for matching a header line.

private Map<String,String>

headers

Map of header lines parsed by the parseHeaders() method.

private Pattern

ignore

The regular expression for matching a comment line.

private int

ignoredLines

Number of ignored lines in the nextData() method.

private boolean

ignoreNonExistingColumns

If non-existing columns should be ignored (true) or result in an exception (false)

private boolean

keepSkippedLines

If unknown or ignored lines should be kept.

private List<FlatFileParser.Line>

lines

List of lines parsed by the parseHeaders() method.

private int

maxDataColumns

The maximun number of allowed data columns for a line to be considered a data line.

private int

maxUnknownLines

The maximum number of unkown lines to parse before giving up.

private int

minDataColumns

The minimun number of allowed data columns for a line to be considered a data line.

private FlatFileParser.Data

nextData

The next available data line as parsed by the hasMoreData() method.

private FlatFileParser.Line

nextSection

The line that last matched the section.

private boolean

nullIsNull

If null should be returned for the string NULL (ignoring case) or not.

private NumberFormat

numberFormat

The default number formatter to use for creating mappers.

private boolean

parseAllExcelSheets

Flag to indicate if only a single (=false) or all (=true) Excel sheets should be parsed in one go.

private long

parsedCharacters

The total number of parsed characters so far.

private int

parsedDataLines

The number of data lines parsed in the current section so far.

private int

parsedLines

The total number of lines parsed so far.

private int

parsedSections

The number of sections parsed so far.

private BufferedReader

reader

Reads from the given input stream

private Pattern

section

The regular expression for matching the fist line of a section.

private List<FlatFileParser.Line>

skippedLines

List for keeping ignored and unknown lines in the nextData() method.

private Formatter<Date>

timestampFormat

The default timestamp format to use when creating mappers.

private InputStreamTracker

tracker

For keeping track of the number of bytes parsed.

private boolean

trimQuotes

If quotes should be trimmed from data values or not.

private boolean

trimWhiteSpace

If white space should be trimmed from data values or not.

private int

unknownLines

Number of unknown lines in the nextData() method.

private boolean

useNullIfException

If null should be returned if a (numeric) value can't be parsed.
Constructor Summary

Constructors

Constructor

Description

FlatFileParser()

Create a new FlatFileParser object.
Method Summary

Modifier and Type

Method

Description

private String

convertToNull(String value)

Integer

findColumnHeaderIndex(String regex)

Find the index of a column header using a regular expression for pattern matching.

String

getBofType()

Get the value captured by the BOF marker regular expression.

Integer

getColumnHeaderIndex(String name)

Get the index of a column header with a given name.

List<String>

getColumnHeaders()

Get all column headers that were found by splitting the line matching the setDataHeaderRegexp(Pattern) pattern using the setDataSplitterRegexp(Pattern) pattern.

XlsxToCsvUtil

getCurrentExcelWorkbook()

If the input stream that is being parsed is an Excel document, this method returns information about it.

XlsxToCsvUtil.SheetInfo

getCurrentSheet()

If the input stream that is being parsed is an Excel document, this method returns information about the current worksheet.

Mapper

getDateMapper(String expression)

Get a mapper using the default date format.

Formatter<Date>

getDefaultDateFormat()

Get the default date format.

NumberFormat

getDefaultNumberFormat()

Get the default number format.

Formatter<Date>

getDefaultTimestampFormat()

Get the default timestamp format.

String

getExcelSheet()

Get the name of the Excel sheet that should be or is parsed.

String

getHeader(String name)

Get the value of the header with the specified name.

Set<String>

getHeaderNames()

Get the names of all headers found by the parseHeaders() method.

int

getIgnoredLines()

Get the number of lines that the last call to nextData() or hasMoreData() ignored because they matched the ignore regular expression.

FlatFileParser.Line

getLine(int index)

Get the line with the specified number.

int

getLineCount()

Get the number of lines that the parseHeaders() method parsed.

List<FlatFileParser.Line>

getLines()

Get the lines read by parseHeaders().

Mapper

getMapper(String expression)

Get a mapper using the default number format.

Mapper

getMapper(String expression, boolean nullIfException)

Get a mapper using the default number format.

Mapper

getMapper(String expression, NumberFormat numberFormat)

Get a mapper using a specific number format.

Mapper

getMapper(String expression, NumberFormat numberFormat, boolean nullIfException)

Mapper

getMapper(String expression, NumberFormat numberFormat, boolean nullIfException, JepFunction... functions)

Mapper

getMapper(String expression, NumberFormat numberFormat, Formatter<Date> dateFormat, boolean nullIfException, JepFunction... functions)

Create a mapper object that maps an expression string to a value.

Mapper

getMapper(String expression, Formatter<Date> dateFormat, boolean nullIfException)

Get a mapper using the specified date format.

Mapper

getMapper(String expression, JepFunction... functions)

int

getNumSkippedLines()

Get the number of lines that the last call to nextData() or hasMoreData() ignored because they matched the ignore regular expression or couldn't be interpreted as data lines.

boolean

getParseAllExcelSheets()

long

getParsedBytes()

Get the number of parsed bytes so far.

long

getParsedCharacters()

Get the number of parsed characters so far.

int

getParsedDataLines()

Get the number of parsed data lines so far in the current section.

int

getParsedLines()

Get the number of parsed lines so far.

int

getParsedSections()

Get the number of found sections so far.

List<FlatFileParser.Line>

getSkippedLines()

Get lines that was skipped during the last call to nextData() or hasMoreData().

Mapper

getTimestampMapper(String expression)

Get a mapper using the default timestamp format.

int

getUnknownLines()

Get the number of lines that the last call to nextData() or hasMoreData() ignored because they couldn't be interpreted as data lines.

boolean

getUseNullIfException()

boolean

hasMoreData()

Check if the input stream contains more data.

boolean

hasMoreSections()

Check if the input stream contains more sections.

FlatFileParser.Data

nextData()

Get the next available data.

FlatFileParser.Line

nextSection()

Get the next line that matches the section regular expression.

FlatFileParser.LineType

parseHeaders()

Start parsing the input stream.

boolean

parseToBof()

Parse the file until the beginning-of-file marker is found.

void

setBofMarkerRegexp(Pattern regexp)

Set a regular expression that maches a beginning-of-file marker.

void

setDataFooterRegexp(Pattern regexp)

Set a regular expression that can be matched against a data footer.

void

setDataHeaderRegexp(Pattern regexp)

Set a regular expression that can be matched against the data header.

void

setDataSplitterRegexp(Pattern regexp)

Set a regular expression that is used to split a data line into columns.

void

setDefaultDateFormat(Formatter<Date> dateFormat)

Set the default date format to use when creating mappers.

void

setDefaultNumberFormat(NumberFormat numberFormat)

Set the default number format to use when creating mappers.

void

setDefaultTimestampFormat(Formatter<Date> timestampFormat)

Set the default timestamp format to use when creating mappers.

void

setExcelSheet(String name)

Set the name of Excel worksheet to parse if the given file is an Excel file, otherwise this is ignored.

void

setHeaderRegexp(Pattern regexp)

Set a regular expression that can be matched against a header.

void

setIgnoreNonExistingColumns(boolean ignoreNonExistingColumns)

Specify if trying to create a mapper with one of the getMapper(String) methods for an expression which references a non-existing column should result in an exception or be ignored.

void

setIgnoreRegexp(Pattern regexp)

Set a regular expression that is used to match a line that should be ignored.

void

setInputStream(InputStream in, String charsetOrSheetName)

Set the input stream that will be parsed.

void

setKeepSkippedLines(boolean keep)

If the nextData() and hasMoreData() methods should keep information of lines that was skipped because they matched the ignore pattern or could be interpreted as data lines.

void

setMaxDataColumns(int columns)

Set the maximum number of columns a data line can contain in order for it to be counted as a data line.

void

setMaxUnknownLines(int lines)

The number of unknown lines in a row that can be parsed by the parseHeaders method before it gives up.

void

setMinDataColumns(int columns)

Set the minimum number of columns a data line must contain in order for it to be counted as a data line.

void

setParseAllExcelSheets(boolean parseAllExcelSheets)

If this flag is set and the source file is an Excel file, then all sheets will be parsed unless a named sheet specified.

void

setSectionRegexp(Pattern regexp)

Set a regular expression that can be matched against the section line.

void

setTrimQuotes(boolean trimQuotes)

Set if quotes around each data value should be removed or not.

void

setTrimWhiteSpace(boolean trimWhiteSpace)

Set a flag indicating if white-space should be trimmed from start and end of data values.

void

setUseNullIfEmpty(boolean emptyIsNull)

Specify if null values should be returned instead of empty strings for columns that doesn't contain any value.

void

setUseNullIfException(boolean useNullIfException)

Specify if null should be returned if a (numeric) value can't be parsed.

void

setUseNullIfNull(boolean nullIsNull)

Specify if null values should be returned for strings having the value "NULL" (ignoring case).

String[]

trimQuotes(String[] columns)

Remove enclosing quotes (" or ') around all columns.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- DEFAULT_MAX_UNKNOWN_LINES
  
  public static final int DEFAULT_MAX_UNKNOWN_LINES
  
  The default value for the number of unknown lines in a row that may be encountered by the parseHeaders method before it gives up.
  See Also:
  
  setMaxUnknownLines
  
  Constant Field Values
- findColumn
  
  private static final Pattern findColumn
  
  Pattern used to find column mappings in a string, ie. abc \col\ def
- reader
  
  private BufferedReader reader
  
  Reads from the given input stream
- tracker
  
  private InputStreamTracker tracker
  
  For keeping track of the number of bytes parsed.
- excelSheetName
  
  private String excelSheetName
  
  Name of the Excel sheet to parse if the file is an Excel file.
- parseAllExcelSheets
  
  private boolean parseAllExcelSheets
  
  Flag to indicate if only a single (=false) or all (=true) Excel sheets should be parsed in one go.
- excelWorkbook
  
  private XlsxToCsvUtil excelWorkbook
  
  Excel workbook that has been loaded.
- excelSheet
  
  private XlsxToCsvUtil.SheetInfo excelSheet
  
  Excel sheet that is currently being parsed.
- excelParsedLinesOffset
  
  private int excelParsedLinesOffset
  
  The value of the parsedLines when the current Excel sheet is parsed. Used to find the correct row number in the current sheet (=parsedLines - excelParsedLinesOffset)
- bofMarker
  
  private Pattern bofMarker
  
  The regular expression for matching the beginning-of-file marker
- header
  
  private Pattern header
  
  The regular expression for matching a header line.
- section
  
  private Pattern section
  
  The regular expression for matching the fist line of a section. The expression must have one capturing group.
- dataHeader
  
  private Pattern dataHeader
  
  The regular expression for matching the data header line. The expression must have two capturing groups.
- dataSplitter
  
  private Pattern dataSplitter
  
  The regular expression for splitting a data line.
- trimQuotes
  
  private boolean trimQuotes
  
  If quotes should be trimmed from data values or not. Default true. Quotes are double or single quotes.
- trimWhiteSpace
  
  private boolean trimWhiteSpace
  
  If white space should be trimmed from data values or not. Default is false.
  
  Since:
  
  3.15.1
- dataFooter
  
  private Pattern dataFooter
  
  The regular expression for matching the data footer line.
- minDataColumns
  
  private int minDataColumns
  
  The minimun number of allowed data columns for a line to be considered a data line.
- maxDataColumns
  
  private int maxDataColumns
  
  The maximun number of allowed data columns for a line to be considered a data line.
- ignore
  
  private Pattern ignore
  
  The regular expression for matching a comment line.
- maxUnknownLines
  
  private int maxUnknownLines
  
  The maximum number of unkown lines to parse before giving up.
- emptyIsNull
  
  private boolean emptyIsNull
  
  If null should be returned for empty columns (instead of an empty string).
- useNullIfException
  
  private boolean useNullIfException
  
  If null should be returned if a (numeric) value can't be parsed.
- ignoreNonExistingColumns
  
  private boolean ignoreNonExistingColumns
  
  If non-existing columns should be ignored (true) or result in an exception (false)
- nullIsNull
  
  private boolean nullIsNull
  
  If null should be returned for the string NULL (ignoring case) or not.
- numberFormat
  
  private NumberFormat numberFormat
  
  The default number formatter to use for creating mappers.
- dateFormat
  
  private Formatter<Date> dateFormat
  
  The default date formatter to use when creating mappers.
- timestampFormat
  
  private Formatter<Date> timestampFormat
  
  The default timestamp format to use when creating mappers.
- bofType
  
  private String bofType
  
  The value that was captured by the bofMarker pattern.
- lines
  
  private List<FlatFileParser.Line> lines
  
  List of lines parsed by the parseHeaders() method.
- parsedLines
  
  private int parsedLines
  
  The total number of lines parsed so far.
- parsedSections
  
  private int parsedSections
  
  The number of sections parsed so far.
- parsedCharacters
  
  private long parsedCharacters
  
  The total number of parsed characters so far.
- parsedDataLines
  
  private int parsedDataLines
  
  The number of data lines parsed in the current section so far. This value is reset at each new section.
- headers
  
  private Map<String,String> headers
  
  Map of header lines parsed by the parseHeaders() method. The map contains name -> value pairs
- columnHeaders
  
  private List<String> columnHeaders
  
  List of the column names found by splitting the data header using the data splitter regexp.
- nextSection
  
  private FlatFileParser.Line nextSection
  
  The line that last matched the section.
- nextData
  
  private FlatFileParser.Data nextData
  
  The next available data line as parsed by the hasMoreData() method.
- ignoredLines
  
  private int ignoredLines
  
  Number of ignored lines in the nextData() method.
- unknownLines
  
  private int unknownLines
  
  Number of unknown lines in the nextData() method.
- keepSkippedLines
  
  private boolean keepSkippedLines
  
  If unknown or ignored lines should be kept.
  See Also:
  
  getSkippedLines()
- skippedLines
  
  private List<FlatFileParser.Line> skippedLines
  
  List for keeping ignored and unknown lines in the nextData() method.
Constructor Details
- FlatFileParser
  
  public FlatFileParser()
  
  Create a new FlatFileParser object.
Method Details
- setBofMarkerRegexp
  
  public void setBofMarkerRegexp(Pattern regexp)
  
  Set a regular expression that maches a beginning-of-file marker. This property should be set before starting to parse the file (otherwise it is ignored). The first method call that causes the parsing to be started will invoke parseToBof() (can also be invoked manually).
  The regular expression may contain a single capturing group. The matched value is returned by getBofType().
  
  Parameters:
  
  regexp - A regular expression
  
  Since:
  
  2.15
- setHeaderRegexp
  
  public void setHeaderRegexp(Pattern regexp)
  Set a regular expression that can be matched against a header. The regular expression must contain two capturing groups, the first should capture the name and the second the value of the header. For example, the file contains headers like:
  "Type=GenePix Results 1.3" "DateTime=2002/09/04 13:59:48"
  To match this we can use the following regular expression: "(.*)=(.*)".
  Parameters:
  
  regexp - A regular expression
- setSectionRegexp
  
  public void setSectionRegexp(Pattern regexp)
  Set a regular expression that can be matched against the section line. For example, the file contains a section like:
  [FileInformation]
  To match this we can use the following regular expression: section (.*). This will match to anything that starts with "section ". The section name will be in the capturing group.
  Parameters:
  
  regexp - A regular expression
- setDataHeaderRegexp
  
  public void setDataHeaderRegexp(Pattern regexp)
  Set a regular expression that can be matched against the data header. For example, the file contains a data header like:
  "Block"{tab}"Column"{tab}"Row"{tab}"Name"{tab}"ID" ...and so on
  To match this we can use the following regular expression: "(.*?)"(\t"(.*?)"). This will match to anything that has at least two columns. We could also be more specific and use: "Block"\t"Column"\t"Row"\t"Name"\t"ID"...
  Parameters:
  
  regexp - A regular expression
- setDataSplitterRegexp
  
  public void setDataSplitterRegexp(Pattern regexp)
  
  Set a regular expression that is used to split a data line into columns. To split on tabs we use: \t. This regular expression is also used to split the data header line into column names, which can then be used in the getMapper(String) method.
  Parameters:
  
  regexp - A regular expression
  
  See Also:
  
  setMinDataColumns
  
  setMaxDataColumns
- setTrimQuotes
  
  public void setTrimQuotes(boolean trimQuotes)
  
  Set if quotes around each data value should be removed or not. A quote is either a double quote (") or a single quote ('). The default setting of this option is true.
  
  Parameters:
  
  trimQuotes - TRUE to remove quotes, FALSE to keep them
- setTrimWhiteSpace
  
  public void setTrimWhiteSpace(boolean trimWhiteSpace)
  
  Set a flag indicating if white-space should be trimmed from start and end of data values. The default setting is false.
  
  Parameters:
  
  trimWhiteSpace - TRUE to remove white-space, FALSE to keep them
  
  Since:
  
  3.15.1
- setMinDataColumns
  
  public void setMinDataColumns(int columns)
  
  Set the minimum number of columns a data line must contain in order for it to be counted as a data line.
  
  Parameters:
  
  columns - The minimum number of columns
- setMaxDataColumns
  
  public void setMaxDataColumns(int columns)
  
  Set the maximum number of columns a data line can contain in order for it to be counted as a data line.
  
  Parameters:
  
  columns - The maximum number of columns, or 0 for an unlimited number, or -1 to disable counting the number of columns
- setDataFooterRegexp
  
  public void setDataFooterRegexp(Pattern regexp)
  
  Set a regular expression that can be matched against a data footer. If a line matching this pattern is found while looking for data with the hasMoreData method it will exit and no more data will be returned.
  
  Parameters:
  
  regexp - A regular expression
- setIgnoreRegexp
  
  public void setIgnoreRegexp(Pattern regexp)
  
  Set a regular expression that is used to match a line that should be ignored. For example, the file may contain comments starting with a #: \#.*
  
  Parameters:
  
  regexp - A regular expression
- setMaxUnknownLines
  
  public void setMaxUnknownLines(int lines)
  
  The number of unknown lines in a row that can be parsed by the parseHeaders method before it gives up. The default value is specified by {#link #DEFAULT_MAX_UNKNOWN_LINES}. This value is ignored while parsing data.
  
  Parameters:
  
  lines - The number of lines
- setUseNullIfEmpty
  
  public void setUseNullIfEmpty(boolean emptyIsNull)
  
  Specify if null values should be returned instead of empty strings for columns that doesn't contain any value.
  
  Parameters:
  
  emptyIsNull - TRUE to return null, FALSE to return an empty string
- setUseNullIfNull
  
  public void setUseNullIfNull(boolean nullIsNull)
  
  Specify if null values should be returned for strings having the value "NULL" (ignoring case).
  
  Parameters:
  
  nullIsNull - TRUE to return null, FALSE to return the original string value
- setKeepSkippedLines
  
  public void setKeepSkippedLines(boolean keep)
  
  If the nextData() and hasMoreData() methods should keep information of lines that was skipped because they matched the ignore pattern or could be interpreted as data lines. The default is FALSE. The number of lines that was skipped is always available regardless of this setting.
  Parameters:
  
  keep - TRUE to keep line information, FALSE to not
  
  See Also:
  
  getSkippedLines()
  
  getIgnoredLines()
  
  getUnknownLines()
  
  getNumSkippedLines()
- setParseAllExcelSheets
  
  public void setParseAllExcelSheets(boolean parseAllExcelSheets)
  
  If this flag is set and the source file is an Excel file, then all sheets will be parsed unless a named sheet specified. Each sheet is handled like a section with the sheet name inside brackets ([name]). The regular expression for detecting a section is automatically updated to match this pattern.
  
  Since:
  
  3.15
- getParseAllExcelSheets
  
  public boolean getParseAllExcelSheets()
  Since:
  
  3.15
  
  See Also:
  
  setParseAllExcelSheets(boolean)
- setExcelSheet
  
  public void setExcelSheet(String name)
  
  Set the name of Excel worksheet to parse if the given file is an Excel file, otherwise this is ignored.
  
  Since:
  
  3.15
- getExcelSheet
  
  public String getExcelSheet()
  
  Get the name of the Excel sheet that should be or is parsed.
  
  Since:
  
  3.15
- getCurrentExcelWorkbook
  
  public XlsxToCsvUtil getCurrentExcelWorkbook()
  
  If the input stream that is being parsed is an Excel document, this method returns information about it.
  
  Returns:
  
  An XlsxToCsvUtil object or null if the stream is not an Excel document
  
  Since:
  
  3.15.1
- getCurrentSheet
  
  public XlsxToCsvUtil.SheetInfo getCurrentSheet()
  
  If the input stream that is being parsed is an Excel document, this method returns information about the current worksheet.
  
  Returns:
  
  An XlsxToCsvUtil.SheetInfo object or null if the stream is not an Excel document
  
  Since:
  
  3.15.1
- setInputStream
  
  public void setInputStream(InputStream in, String charsetOrSheetName)
  
  Set the input stream that will be parsed. The stream can be either a text CSV-like stream or an Excel workbook (xlsx). If the stream is an Excel workbook the following apply: If no date format has been specified, yyyy-MM-dd is used If no timestamp format has been specified, yyyy-MM-dd HH:mm:ss is used If no number format has been specified, 'dot' is used The data splitter regular expression is changed to \\t The section regular expression is changed to [.*] (if the getParseAllExcelSheets() flag is set
  
  Parameters:
  
  in - The InputStream
  
  charsetOrSheetName - If CSV, the name of the character set to use when parsing the file, or null to use the default charset specified by Config.getCharset() If Excel, the name or index of the worksheet in the workbook, the default is to parse the first sheet (index=0) or the whole workbook if the getParseAllExcelSheets() flag is set
  
  Since:
  
  2.1.1
- parseToBof
  
  public boolean parseToBof() throws IOException
  
  Parse the file until the beginning-of-file marker is found. If no regular expression has been set with setBofMarkerRegexp(Pattern) or if the parsing of the file has already started, this method call is ignored.
  
  Returns:
  
  TRUE if this call resulted in parsing and the BOF marker was found, FALSE otherwise
  
  Throws:
  
  IOException
  
  Since:
  
  2.15
- getBofType
  
  public String getBofType()
  
  Get the value captured by the BOF marker regular expression. If no capturing groups was specified in the pattern this value is the string that matched the entire pattern.
  
  Returns:
  
  The matched value, or null if BOF matching has not been done
  
  Since:
  
  2.15
- parseHeaders
  
  public FlatFileParser.LineType parseHeaders() throws IOException
  Start parsing the input stream. The parser will read a single line at a time. Each line is checked in the following order:
  
  Does it match the section regular expression?
  Does it match the header regular expression?
  Does it match the data header regular expression?
  Does it match the comment regular expression?
  Can it be split by the data regular expression into the appropriate number of columns?
  The first expression that matches stops the processing of that line. If the line matched a header or comment the parser continues with the next line. If the line matched the data header or data, the method returns. If none of the above is true the line is recorded as FlatFileParser.LineType.UNKNOWN and processing is continued with the next line. If too many unkown lines in a row has been found the method also returns. This should be considered as a failure to parse the specified file.
  The method returns the type of the last line that was parsed as follows:
  
  FlatFileParser.LineType.SECTION: The last line was a section. Header, data header or data may follow this line.
  FlatFileParser.LineType.DATA_HEADER: The last line was the data header. It is expected that data should follow.
  FlatFileParser.LineType.DATA: The last line was a data line. More data may follow.
  FlatFileParser.LineType.UNKNOWN: The last line was of unknown format. The file could not be parsed.
  Returns:
  
  The FlatFileParser.LineType of the last parsed line
  
  Throws:
  
  IOException - If reading the file fails.
- convertToNull
  
  private String convertToNull(String value)
- getHeaderNames
  
  public Set<String> getHeaderNames()
  
  Get the names of all headers found by the parseHeaders() method. To get the value of a header, use the getHeader(String) method.
- getHeader
  
  public String getHeader(String name)
  
  Get the value of the header with the specified name. This method should only be used after parseHeaders() has been completed.
  Parameters:
  
  name - The name of the header
  
  Returns:
  
  The value of the header, or null if it was not found
  
  See Also:
  
  getLine(int)
- getLineCount
  
  public int getLineCount()
  
  Get the number of lines that the parseHeaders() method parsed.
  
  Returns:
  
  The number of lines parsed
- getLine
  
  public FlatFileParser.Line getLine(int index)
  
  Get the line with the specified number. This method should only be used after parseHeaders() has been completed.
  Parameters:
  
  index - The line number, starting at 0
  
  Returns:
  
  A Line object
  
  See Also:
  
  getHeader(String)
- getLines
  
  public List<FlatFileParser.Line> getLines()
  
  Get the lines read by parseHeaders().
  
  Returns:
  
  The lines in the order that they have been read.
- getColumnHeaders
  
  public List<String> getColumnHeaders()
  
  Get all column headers that were found by splitting the line matching the setDataHeaderRegexp(Pattern) pattern using the setDataSplitterRegexp(Pattern) pattern. This method should only be called after parseHeaders() has been called.
  
  Returns:
  
  A list containing the column headers, or null if no headers have been found
- getColumnHeaderIndex
  
  public Integer getColumnHeaderIndex(String name)
  
  Get the index of a column header with a given name. This method should only be called after parseHeaders() has been called. If more than one header with the same name exists the index of the first is returned.
  Parameters:
  
  name - The name of the column header
  
  Returns:
  
  The index, or null if no header with that name exists
  
  See Also:
  
  findColumnHeaderIndex(String)
- findColumnHeaderIndex
  
  public Integer findColumnHeaderIndex(String regex)
  
  Find the index of a column header using a regular expression for pattern matching. This method should only be called after parseHeaders() has been called. If more than one header matches the regular expression only the first one found is returned.
  Parameters:
  
  regex - The regular expression used to match the header names
  
  Returns:
  
  The index, or null if no header is matching the regular expression or if the string is not a valid regular expression
  
  Since:
  
  2.5
  
  See Also:
  
  getColumnHeaderIndex(String)
- setDefaultNumberFormat
  
  public void setDefaultNumberFormat(NumberFormat numberFormat)
  
  Set the default number format to use when creating mappers.
  Parameters:
  
  numberFormat - The number format to use, or null to parse numbers with Float.valueOf or Double.valueOf
  
  Since:
  
  2.2
  
  See Also:
  
  getMapper(String)
  
  getMapper(String, NumberFormat)
- getDefaultNumberFormat
  
  public NumberFormat getDefaultNumberFormat()
  
  Get the default number format.
  
  Returns:
  
  The number format, or null if none has been specified
  
  Since:
  
  2.2
- setDefaultDateFormat
  
  public void setDefaultDateFormat(Formatter<Date> dateFormat)
  
  Set the default date format to use when creating mappers. If null, xxx is used.
  
  Since:
  
  3.15
- getDefaultDateFormat
  
  public Formatter<Date> getDefaultDateFormat()
  
  Get the default date format.
  
  Since:
  
  3.15
- setDefaultTimestampFormat
  
  public void setDefaultTimestampFormat(Formatter<Date> timestampFormat)
  
  Set the default timestamp format to use when creating mappers. If null, xxx is used.
  
  Since:
  
  3.15
- getDefaultTimestampFormat
  
  public Formatter<Date> getDefaultTimestampFormat()
  
  Get the default timestamp format.
  
  Since:
  
  3.15
- setUseNullIfException
  
  public void setUseNullIfException(boolean useNullIfException)
  
  Specify if null should be returned if a (numeric) value can't be parsed. If this setting is set to TRUE all mappers created by one of the getMapper(String) methods are wrapped in a NullIfExceptionMapper. It is not possible to log or get information about the exception.
  
  Parameters:
  
  useNullIfException - TRUE to return null, FALSE to throw an exception
  
  Since:
  
  2.4
- getUseNullIfException
  
  public boolean getUseNullIfException()
  
  Since:
  
  3.15
- setIgnoreNonExistingColumns
  
  public void setIgnoreNonExistingColumns(boolean ignoreNonExistingColumns)
  
  Specify if trying to create a mapper with one of the getMapper(String) methods for an expression which references a non-existing column should result in an exception or be ignored.
  
  Parameters:
  
  ignoreNonExistingColumns - TRUE to ignore, or FALSE to throw an exception
  
  Since:
  
  2.6
- getMapper
  
  public Mapper getMapper(String expression)
  
  Get a mapper using the default number format.
  See Also:
  
  getMapper(String, NumberFormat, boolean)
- getMapper
  
  public Mapper getMapper(String expression, JepFunction... functions)
- getMapper
  
  public Mapper getMapper(String expression, boolean nullIfException)
  
  Get a mapper using the default number format.
  Since:
  
  2.4
  
  See Also:
  
  getMapper(String, NumberFormat, boolean)
- getMapper
  
  public Mapper getMapper(String expression, NumberFormat numberFormat)
  
  Get a mapper using a specific number format.
  Since:
  
  2.2
  
  See Also:
  
  getMapper(String, NumberFormat, boolean)
- getMapper
  
  public Mapper getMapper(String expression, NumberFormat numberFormat, boolean nullIfException)
  Since:
  
  2.4
  
  See Also:
  
  getMapper(String, NumberFormat, boolean, JepFunction...)
- getDateMapper
  
  public Mapper getDateMapper(String expression)
  
  Get a mapper using the default date format.
  
  Since:
  
  3.15
- getTimestampMapper
  
  public Mapper getTimestampMapper(String expression)
  
  Get a mapper using the default timestamp format.
  
  Since:
  
  3.15
- getMapper
  
  public Mapper getMapper(String expression, Formatter<Date> dateFormat, boolean nullIfException)
  
  Get a mapper using the specified date format.
  
  Since:
  
  3.15
- getMapper
  
  public Mapper getMapper(String expression, NumberFormat numberFormat, boolean nullIfException, JepFunction... functions)
- getMapper
  
  public Mapper getMapper(String expression, NumberFormat numberFormat, Formatter<Date> dateFormat, boolean nullIfException, JepFunction... functions)
  Create a mapper object that maps an expression string to a value. An expression string is a regular string which contains placeholders where the data column values will be inserted. For example:
  \1\ \row\ Row: \row\, Col:\col\
  It is also possible to use expressions that are evaluated dynamically.
  =2 * col('Radius')
  If no column that is matching the exact name is found the placeholder is interpreted as a regular expression which is checked against each of the column headers. In all cases, the first column header found is used if there are multiple matches.
  If the expression is null, a mapper returning en empty string is returned, unless the setUseNullIfEmpty(boolean) has been activated. In that case the mapper returns null.
  Parameters:
  
  expression - The string containing the mapping expression
  
  numberFormat - The number format the mapper should use for parsing numbers, or null to use Float.valueOf or Double.valueOf
  
  dateFormat - The date format the mapper should use for parsing dates, or null to use Type.DATE.parseString()
  
  nullIfException - TRUE to return a null value instead of throwing an exception when a value can't be parsed.
  
  functions - Optional array with Jep functions that should be included in the parser
  
  Returns:
  
  A mapper object
  
  Since:
  
  3.15
- hasMoreData
  
  public boolean hasMoreData() throws IOException
  Check if the input stream contains more data. If it is unknown if there is more data or not, this method will start reading more lines from the stream. Each line is checked in the following order:
  
  Does it match the data footer regular expression?
  
  Does it match the section regular expression?
  
  Does it match the ignore regular expression?
  
  Can it be split by the data regular expression into the appropriate number of columns?
  
  If the first or second check is true, FALSE is returned no more data may be retrieved, but a section may be retrived with the nextSection method. If the third check is true, the line is ignored and the processing continues with the next line. If the fourth check is true, TRUE is returned and the data may be retrieved with the nextData method.
  Returns:
  
  TRUE if there is more data, FALSE otherwise
  
  Throws:
  
  IOException - If there is an error reading from the input stream
  
  See Also:
  
  nextData
- trimQuotes
  
  public String[] trimQuotes(String[] columns)
  
  Remove enclosing quotes (" or ') around all columns.
  
  Parameters:
  
  columns - The columns
  
  Returns:
  
  The trimmed columns
- getParsedLines
  
  public int getParsedLines()
  
  Get the number of parsed lines so far.
- getParsedSections
  
  public int getParsedSections()
  
  Get the number of found sections so far.
  
  Since:
  
  3.15
- getParsedDataLines
  
  public int getParsedDataLines()
  
  Get the number of parsed data lines so far in the current section. This value is reset for each new section.
- getParsedCharacters
  
  public long getParsedCharacters()
  
  Get the number of parsed characters so far. This value may or may not correspond to the number of parsed bytes depending on the character set of the file.
  See Also:
  
  getParsedBytes()
- getParsedBytes
  
  public long getParsedBytes()
  
  Get the number of parsed bytes so far. This value may or may not correspond to the number of parsed characters depending on the character set of the file.
  Since:
  
  2.5.1
  
  See Also:
  
  getParsedCharacters()
- nextData
  
  public FlatFileParser.Data nextData() throws IOException
  
  Get the next available data.
  Returns:
  
  A Data object, or null if there is no more data
  
  Throws:
  
  IOException - If the is an error reading from the input stream.
  
  See Also:
  
  hasMoreData
- getIgnoredLines
  
  public int getIgnoredLines()
  
  Get the number of lines that the last call to nextData() or hasMoreData() ignored because they matched the ignore regular expression.
  Returns:
  
  The number of ignored lines
  
  See Also:
  
  setIgnoreRegexp(Pattern)
  
  setKeepSkippedLines(boolean)
- getUnknownLines
  
  public int getUnknownLines()
  
  Get the number of lines that the last call to nextData() or hasMoreData() ignored because they couldn't be interpreted as data lines.
  Returns:
  
  The number of unknown lines
  
  See Also:
  
  setKeepSkippedLines(boolean)
- getNumSkippedLines
  
  public int getNumSkippedLines()
  
  Get the number of lines that the last call to nextData() or hasMoreData() ignored because they matched the ignore regular expression or couldn't be interpreted as data lines.
  Returns:
  
  The number of ignored or unknown lines
  
  See Also:
  
  getIgnoredLines()
  
  getUnknownLines()
  
  getSkippedLines()
- getSkippedLines
  
  public List<FlatFileParser.Line> getSkippedLines()
  
  Get lines that was skipped during the last call to nextData() or hasMoreData(). The list is only available if the setKeepSkippedLines(boolean) has been set to true (default is false).
  Returns:
  
  A list with the skipped lines
  
  See Also:
  
  setKeepSkippedLines(boolean)
- hasMoreSections
  
  public boolean hasMoreSections() throws IOException
  
  Check if the input stream contains more sections. If it is unknown if there is more sections or not, this method will start reading more lines from the stream. Each line is checked if it matches the section regular expression. The parser will continue util a section line is found or end of file is reached. If the metod return TRUE the section may be retrived with the nextSection() method. If the section regular expression isn't specified the method returns FALSE and won't parse any line.
  Returns:
  
  TRUE if there is more data, FALSE otherwise
  
  Throws:
  
  IOException - If there is an error reading from the input stream
  
  See Also:
  
  nextData()
- nextSection
  
  public FlatFileParser.Line nextSection() throws IOException
  
  Get the next line that matches the section regular expression.
  Returns:
  
  The line that matched the regular expression
  
  Throws:
  
  IOException
  
  See Also:
  
  hasMoreSections()

Class FlatFileParser

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

DEFAULT_MAX_UNKNOWN_LINES

findColumn

reader

tracker

excelSheetName

parseAllExcelSheets

excelWorkbook

excelSheet

excelParsedLinesOffset

bofMarker

header

section

dataHeader

dataSplitter

trimQuotes

trimWhiteSpace

dataFooter

minDataColumns

maxDataColumns

ignore

maxUnknownLines

emptyIsNull

useNullIfException

ignoreNonExistingColumns

nullIsNull

numberFormat

dateFormat

timestampFormat

bofType

lines

parsedLines

parsedSections

parsedCharacters

parsedDataLines

headers

columnHeaders

nextSection

nextData

ignoredLines

unknownLines

keepSkippedLines

skippedLines

Constructor Details

FlatFileParser

Method Details

setBofMarkerRegexp

setHeaderRegexp

setSectionRegexp

setDataHeaderRegexp

setDataSplitterRegexp

setTrimQuotes

setTrimWhiteSpace

setMinDataColumns

setMaxDataColumns

setDataFooterRegexp

setIgnoreRegexp

setMaxUnknownLines

setUseNullIfEmpty

setUseNullIfNull

setKeepSkippedLines

setParseAllExcelSheets

getParseAllExcelSheets

setExcelSheet

getExcelSheet

getCurrentExcelWorkbook

getCurrentSheet

setInputStream

parseToBof

getBofType

parseHeaders

convertToNull

getHeaderNames

getHeader