#2006 closed enhancement (fixed)

Use parallel gzip implementation when compressing files

Reported by: nicklas Owned by: nicklas
Priority: major Milestone: BASE 3.9
Component: core Version:
Keywords: Cc:

Description

Just found this: https://github.com/shevek/parallelgzip

Which seems interesting since we already know that pigz is a good performance booster: http://baseplugins.thep.lu.se/ticket/809#comment:5

I think we should try this out for the "Store compressed" option when saving files to the BASE file system. It may not improve things when loading files over the netweork, but there are cases when files are created locally.

The release export wizard developed for reggie (http://baseplugins.thep.lu.se/ticket/887) can generate files several GB in size. On my dev computer the throughput when storing compressed is 3Mb/s and 20Mb/s when storing uncompressed.

Attachments (1)

pgzip-test.png (15.4 KB) - added by nicklas 18 months ago.
Using 7z to test a file that was created with PGZip

Download all attachments as: .zip

Change History (12)

comment:1 Changed 18 months ago by nicklas

  • Owner changed from everyone to nicklas
  • Status changed from new to assigned

comment:2 Changed 18 months ago by nicklas

Tested the parallel implementation with the release exporter. Throughput is up to 8MB/s. Considering that the compressed file size is about 25% of the uncompressed this means that we are gaining time also over the uncompressed alternative.

comment:3 Changed 18 months ago by nicklas

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [7152]) Fixes #2006: Use parallel gzip implementation when compressing files

comment:4 Changed 18 months ago by nicklas

Reopened due to #2016.

comment:5 Changed 18 months ago by nicklas

  • Resolution fixed deleted
  • Status changed from closed to reopened

Changed 18 months ago by nicklas

Using 7z to test a file that was created with PGZip

comment:6 Changed 18 months ago by nicklas

Digging up the file from the internal BASE storage and testing it with 7z results in an error: Using 7z to test a file that was created with PGZip

comment:7 Changed 18 months ago by nicklas

In theory it should be possible to handle the error while reading the file since we can compare the actual bytes that has been read with the known file size of the original file. Any errors that happens after that can be ignored. For example, wrapping the GZipInputStream with something like this seems to work:

@Override
public int read(byte[] buf, int start, int len)
  throws IOException
{
  try
  {
    return super.read(buf, start, len);
  }
  catch (EOFException ex)
  {
    if (inf.getBytesWritten() != getSize()) throw ex;
  }
  return -1;
}

However, personally I am not so happy about this solution which is more or less a "hack" for working around the problem with creating corrupt files to begin with. I think we should either abandon the PGZip implementation or try to fix the writing of the file.

comment:8 Changed 18 months ago by nicklas

(In [7170]) References #2006 and #2016.

Removed parallelgzip jar file and added source files to the BASE core package instead. The intention is to fix the file size problem. The current code is the original code as downloaded from https://github.com/shevek/parallelgzip (version 1.0.1). The code does not compile due to using non-standard annotation from "javax.annotation" package.

comment:9 Changed 18 months ago by nicklas

(In [7171]) References #2006 and #2016.

Fixed compilation errors by removing @Nonnull and @Nonnegative annotations.

comment:10 Changed 18 months ago by nicklas

(In [7172]) References #2006 and #2016.

Changes the bytesWritten varible from int to long. This seems to trigger a proper close(). The saved file is 10 bytes larger than before and there are no errors when reading it from BASE or when testing it with 7z.

comment:11 Changed 18 months ago by nicklas

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.