Opened 16 years ago

Closed 16 years ago

Last modified 16 years ago

#869 closed defect (fixed)

Upload + unpack of compressed file may corrupt data

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: blocker Milestone: BASE 2.5.1
Component: web Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

When uploading some compressed files and selecting to decompress them things may go wrong. In the published release it seems like an exception is always thrown: 
bad header in block 915 record 8, header magic is not 'ustar' or
unix-style zeros, it is '5649464895554', or 
(dec) 56, 49, 46, 48, 9, 55, 54
  at Source)
  at net.sf.basedb.plugins.TarFileUnpacker.unpack(
  at org.apache.jsp.filemanager.upload.upload_jsp._jspService(

It seems like the exception is happening after processing a file. I have changed the code to ignore the exception just to see what happens. Everything seems fins but there are two issues:

  • The decompressed file has been corrupted. This can be seen by them having different md5 values. The raw upload has the same md5 as I get if I run 'md5sum' from the local command prompt. Comparing the two files I find that the start to differ at line 393511. See below for examples.
  • The tar.bz2 file is not complete. It's size on the BASE server (1.0MB) is smaller than the original file (2.1 MB).

The most surprising thing is that if I upload the tar.bz2 without unpacking it the tar.bz2 file is uploaded correctly and it is then possible to decompress the file without corrupting the contents.

Examples of corruption

Original file around line 393511:

525	614	114.0	23.6	 20
526	614	250.0	29.6	 25
527	614	75.0	15.4	 25
528	614	455.5	52.1	 20
529	614	72.0	13.9	 25

Corrupted file:

525	614	114.0	23.6	 20
526	614	250.0	29.6	 25.0	 25
139	22	618.0	46.3	19	617	77.7	114.8	118.0	48.3.0	 25
34	62	633.0	16	627	77	64.0	32	6	 20
23	19	 20


In the cases I have found an exception is always thrown if the file is corrupted, but since the exception seems to happen when looking for the next file, I am not sure that this is true in all possible cases. In all cases it also seems to work if the compressed file is first uploaded without decompressing. Decompressing can be done using the "Run plugin" button from the single-item view page of the compressed file.

Test data

I have added the test data files I have used to the testdata repository referenced from the DeveloperInformation page. It is located in the ticket.869 subdirectory and contains three uncompressed files and three compressed. The only one that can be uploaded and decompressed at the same time is the MG_U74Av2.cdf.bz2 file.

Change History (6)

comment:1 by Nicklas Nordborg, 16 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 16 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

comment:3 by Nicklas Nordborg, 16 years ago

An interesting note is that it seems to work if the compressed file is a GZIP or ZIP. So far the only failing files are BZIP2 files.

comment:4 by Nicklas Nordborg, 16 years ago

After tons of debug output I think I have narrowed it down to the method. There are actually two problems with this method.

  1. It returns the byte value without converting it's range. A byte has a value between -128 and +127, but the read method should return a value in the range 0-255 with -1 as a special marker for end of stream. This means that a uploaded byte with the value 255 is returned as -1 from this method. This may be the cause for the truncation of the original file. This problem should be solved by converting the byte to an int and the bit-masking with 0xff.
  1. As soon as the upload has been completed it returns -1 to signal end of stream. It doesn't consider the case that some of the uploaded bytes may not have been transfered down the pipeline yet. This causes the last bytes of the file upload to never reach the BZIP unpacker, and may very well be the reason that the TAR unpacker fails later on. This problem can be solved by checking the 'available' variable just as the read(byte[]) method does.

comment:5 by Nicklas Nordborg, 16 years ago

Resolution: fixed
Status: assignedclosed

(In [4051]) Fixes #869: Upload + unpack of compressed file may corrupt data

comment:6 by Nicklas Nordborg, 16 years ago

I want to add some more information about how I solved this bug. It may be useful information in the future in case similar bugs should appear in other places.

There are so many Input/OutputStream involved in the file upload+unpacking so it is not very easy to find the spot were it is goes wrong.

First step was to add debug output in strategic places, in this case in the various read() methods of the streams involved. This produced huge log files, but luckily with enough information to find to problem. The important information here was to know which methods got called and what they returned. There are three cases:

File upload without unpacking

  1. First stream is FileUpload.UploadStream. Data is mostly read with read(byte[]) method.
  2. FileUtil.copy() copies this to File.UploadStream.

This happens in the upload.jsp script and is all ok and simple.

File upload with unpacking

  1. First stream is FileUpload.UploadStream. Data is mostly read with read() when using BZIP format. With GZIP and ZIP data is read with read(byte[]).
  2. Second stream is InputStreamTracker. Created by TarFileUnpacker to keep track of the number of bytes that has been read so far which is needed for progress reportering.
  3. Third stream is PushBackInputStream. Created by TarFileUnpacker to check the format of the stream (BZIP or GZIP).
  4. Forth stream is CBzip2InputStream or GZipInputStream.
  5. Fifth stream is TarInputStream.
  6. FileUtil.copy() copies this to File.UploadStream.

This case fails with BZIP format but not with GZIP. We can get rid of the InputStreamTracker and PushBackInputStream if we hardcode a few things. No change in the result. The debug output also shows that are returning negative values. Fixing that doesn't fix the problem, but the truncated file problem goes away.

Unapcking already uploaded file

This case is almost identical to the previous, except that the FileUpload.UploadStream is replaced with a instead. This makes the problem go away and thus the main suspect is of course the FileUpload.UploadStream class. And the main suspect method is of course the read() method since everything seems to work when read(byte[]) is used.

With this information it is easy to spot that the end-of-stream handling is different between the two methods. A final test is to add BufferedInputStream between streams 3 and 4. This effectively changes all read() calls to read(byte[]) calls, and now the problem is gone for BZIP as well.

Changing the implementation of the method solves the problem, but I leave the BufferedInputStream:s in place as well, since it reduces the number of calls going through the entire stream stack.

Note: See TracTickets for help on using tickets.