#869 closed defect (fixed)
Upload + unpack of compressed file may corrupt data
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | blocker | Milestone: | BASE 2.5.1 |
Component: | web | Version: | |
Keywords: | Cc: |
Description (last modified by )
When uploading some compressed files and selecting to decompress them things may go wrong. In the published release it seems like an exception is always thrown:
com.ice.tar.InvalidHeaderException: bad header in block 915 record 8, header magic is not 'ustar' or unix-style zeros, it is '5649464895554', or (dec) 56, 49, 46, 48, 9, 55, 54 at com.ice.tar.TarInputStream.getNextEntry(Unknown Source) at net.sf.basedb.plugins.TarFileUnpacker.unpack(TarFileUnpacker.java:163) at org.apache.jsp.filemanager.upload.upload_jsp._jspService(upload_jsp.java:205)
It seems like the exception is happening after processing a file. I have changed the code to ignore the exception just to see what happens. Everything seems fins but there are two issues:
- The decompressed file has been corrupted. This can be seen by them having different md5 values. The raw upload has the same md5 as I get if I run 'md5sum' from the local command prompt. Comparing the two files I find that the start to differ at line 393511. See below for examples.
- The tar.bz2 file is not complete. It's size on the BASE server (1.0MB) is smaller than the original file (2.1 MB).
The most surprising thing is that if I upload the tar.bz2 without unpacking it the tar.bz2 file is uploaded correctly and it is then possible to decompress the file without corrupting the contents.
Examples of corruption
Original file around line 393511:
525 614 114.0 23.6 20 526 614 250.0 29.6 25 527 614 75.0 15.4 25 528 614 455.5 52.1 20 529 614 72.0 13.9 25
Corrupted file:
525 614 114.0 23.6 20 526 614 250.0 29.6 25.0 25 139 22 618.0 46.3 19 617 77.7 114.8 118.0 48.3.0 25 34 62 633.0 16 627 77 64.0 32 6 20 23 19 20
Workaround
In the cases I have found an exception is always thrown if the file is corrupted, but since the exception seems to happen when looking for the next file, I am not sure that this is true in all possible cases. In all cases it also seems to work if the compressed file is first uploaded without decompressing. Decompressing can be done using the "Run plugin" button from the single-item view page of the compressed file.
Test data
I have added the test data files I have used to the testdata repository referenced from the DeveloperInformation page. It is located in the ticket.869 subdirectory and contains three uncompressed files and three compressed. The only one that can be uploaded and decompressed at the same time is the MG_U74Av2.cdf.bz2 file.
Change History (6)
comment:1 by , 17 years ago
Description: | modified (diff) |
---|
comment:2 by , 17 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:3 by , 17 years ago
comment:4 by , 17 years ago
After tons of debug output I think I have narrowed it down to the FileUpload.UploadStream.read() method. There are actually two problems with this method.
- It returns the byte value without converting it's range. A byte has a value between -128 and +127, but the read method should return a value in the range 0-255 with -1 as a special marker for end of stream. This means that a uploaded byte with the value 255 is returned as -1 from this method. This may be the cause for the truncation of the original file. This problem should be solved by converting the byte to an int and the bit-masking with 0xff.
- As soon as the upload has been completed it returns -1 to signal end of stream. It doesn't consider the case that some of the uploaded bytes may not have been transfered down the pipeline yet. This causes the last bytes of the file upload to never reach the BZIP unpacker, and may very well be the reason that the TAR unpacker fails later on. This problem can be solved by checking the 'available' variable just as the read(byte[]) method does.
comment:5 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:6 by , 17 years ago
I want to add some more information about how I solved this bug. It may be useful information in the future in case similar bugs should appear in other places.
There are so many Input/OutputStream involved in the file upload+unpacking so it is not very easy to find the spot were it is goes wrong.
First step was to add debug output in strategic places, in this case in the various read() methods of the streams involved. This produced huge log files, but luckily with enough information to find to problem. The important information here was to know which methods got called and what they returned. There are three cases:
File upload without unpacking
- First stream is
FileUpload.UploadStream
. Data is mostly read withread(byte[])
method. FileUtil.copy()
copies this toFile.UploadStream
.
This happens in the upload.jsp script and is all ok and simple.
File upload with unpacking
- First stream is
FileUpload.UploadStream
. Data is mostly read withread()
when using BZIP format. With GZIP and ZIP data is read withread(byte[])
. - Second stream is
InputStreamTracker
. Created byTarFileUnpacker
to keep track of the number of bytes that has been read so far which is needed for progress reportering. - Third stream is
PushBackInputStream
. Created byTarFileUnpacker
to check the format of the stream (BZIP or GZIP). - Forth stream is
CBzip2InputStream
orGZipInputStream
. - Fifth stream is
TarInputStream
. FileUtil.copy()
copies this toFile.UploadStream
.
This case fails with BZIP format but not with GZIP. We can get rid of the InputStreamTracker
and PushBackInputStream
if we hardcode a few things. No change in the result. The debug output also shows that FileUpload.UploadStream.read()
are returning negative values. Fixing that doesn't fix the problem, but the truncated file problem goes away.
Unapcking already uploaded file
This case is almost identical to the previous, except that the FileUpload.UploadStream
is replaced with a java.io.FileInputStream
instead. This makes the problem go away and thus the main suspect is of course the FileUpload.UploadStream
class. And the main suspect method is of course the read()
method since everything seems to work when read(byte[])
is used.
With this information it is easy to spot that the end-of-stream handling is different between the two methods. A final test is to add BufferedInputStream
between streams 3 and 4. This effectively changes all read()
calls to read(byte[])
calls, and now the problem is gone for BZIP as well.
Changing the implementation of the FileUpload.UploadStream.read()
method solves the problem, but I leave the BufferedInputStream
:s in place as well, since it reduces the number of calls going through the entire stream stack.
An interesting note is that it seems to work if the compressed file is a GZIP or ZIP. So far the only failing files are BZIP2 files.