Class CharsetDetector

java.lang.Object
net.sf.basedb.util.charset.CharsetDetector

public class CharsetDetector
extends Object
Utility class for testing if a text stream can be parsed using a given character set. There are two sides of the testing: The technical side which checks for invalid byte sequences, etc. This works well for UTF-8 but it is, for example, not able to discriminate betwee different ISO-8859-? or Windows-? encoding. The content side which can check that the parsed content contains some expected text strings. This can be used to discriminate between diffent ISO-8859-? or Windows-? encoding by using careful choices of text strings to look for.
Since:
3.15
Author:
nicklas
  • Field Details

    • charset

      private final Charset charset
    • lineTester

      private final StringDetector lineTester
    • parsingFailure

      private IOException parsingFailure
    • parsedBytes

      private long parsedBytes
    • parsedLines

      private int parsedLines
  • Constructor Details

    • CharsetDetector

      public CharsetDetector​(Charset charset)
      Create a detector for the given character set that only detects technical issues. Useful for UTF-8.
    • CharsetDetector

      public CharsetDetector​(Charset charset, StringDetector lineTester)
      Create a detector for the given character set that uses technical an content-based detection. If no lineTester is given it will use only technical detection.
  • Method Details

    • getCharset

      public Charset getCharset()
      Get the character set this detector is configured to use.
    • testIt

      public boolean testIt​(InputStream in)
      Test if the given input stream can be parsed with the configured character set. The stream is read until the end is reached or until there is a decoding failure.
    • testIt

      public boolean testIt​(InputStream in, long maxBytes, int maxLines)
      Test if the given input stream can be parsed with the configured character set. The stream is read until maxBytes bytes has been parsed or until there is a decoding failure.
      Parameters:
      maxBytes - Max number of bytes to parse or -1 to not use a limit
      maxLines - Max number of lines to parse or -1 to not use a limit
    • getParsedBytes

      public long getParsedBytes()
      Get the number of bytes that the last test operation parsed.
    • getParsedLines

      public int getParsedLines()
    • getParsingFailure

      public IOException getParsingFailure()
      If the last test failed, get the exception that was thrown by the parser.