#1238 closed task (fixed)
Check performance of BioAssaySetExporter
Reported by: | Nicklas Nordborg | Owned by: | Nicklas Nordborg |
---|---|---|---|
Priority: | critical | Milestone: | BASE 2.11 |
Component: | coreplugins | Version: | |
Keywords: | Cc: |
Description (last modified by )
Attachments (6)
Change History (20)
comment:1 by , 16 years ago
Description: | modified (diff) |
---|
comment:2 by , 16 years ago
Description: | modified (diff) |
---|
comment:3 by , 16 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:4 by , 16 years ago
I have now added the initial measurements which are made with a BASE version that is more or less identical to BASE 2.10. See the 'export-times-ref.txt' that is attached to this ticket.
Here is a summary of a few things that was found:
Number of selected reporter fields
This seems to be an important factor for the execution time. The more fields that are selected the longer it takes. The additional joins that are required to select just a single reporter field are not that important (compare Export 1 with Export 2 and Export 3). Since this information is repeated across bioassays a possible performance enhancement may be to split the export into two parts. The first parts select all reporter information and caches it in memory or in a file. The second part would then perform almost as Export 1.
Number of selected raw data fields
The execution time increases a lot if only a single raw data field is selected. It increases more if all raw data fields are selected, but not as much as with reporter fields. It seems like the additional joins that are required to access raw data are important (compare Export 1 with Export 4 and Export 5). I think we may need to analyze the SQL query to see if there may be any performance enhancements. The join involves the same number of tables as for reporter fields. The main difference is that the raw data table has a lot more records than the reporter table.
No big difference between file formats
There seems to be no real difference in execution time between the file formats. Note that MeV can only be compared with "Export 3" since MeV formats always include all reporter fields. This is a bit surprising since the structure of the SQL queries used by Matrix and Serial format are a bit different. The main difference is that Matrix requires a sort on the position column. Another difference is that the Matrix export only uses a single query but the Serial export uses one query for each bioassay. It may be that those two effects are cancelling each other. Experience has told us to keep the number of queries down and I think it would be possible to rewrite the Serial export to use only a single query. By watching the progress bar for the Matrix export it is clear that it uses about 80-90% of the time to execute the SQL query and only about 10-20% to iterate the result and generate the file.
Next steps
- Add code in the 'performance' test program that add jobs for the 6 types of export
- Repeat the tests with the filtered bioassay set
- Try to re-write the serial export to only use a single query.
- Try to re-write the export to get reporter information in a separate query, cache it, and then merge it with the main query
comment:5 by , 16 years ago
(In [4803]) References #1238: Check performance of BioAssaySetExporter
Added test code for generating export jobs.
by , 16 years ago
Attachment: | export-times-auster-ref.txt added |
---|
Reference measurements of BioAssaySetExporter on auster
comment:6 by , 16 years ago
Try to re-write the serial export to only use a single query.
This seems to have no effect. Using 1 query for everything or 80 queries each returning 1/80th of the data uses more or less the same time. The sort on the position column happens in the serial case as well, but it seems like it doesn't have any effect on the time.
This result may seem to contradict previous experience. Particularly with the Lowess plug-in, which gained a lot in performance by decreasing the number of queries. But I think it can be explained by the columns that was used in the filter expression. In Lowess we filtered on an expression on the 'raw.block' number, but the BioAssaySetExporter filters on 'column'. The difference is that 'column' is part of the primary key but block number isn't. Thus, finding rows with a matching 'column' value is quick, but finding the correct block means that all data in a bioassay set must be proccessed for each query. The end result was that each query used the same as amount of time as a query that returned all data in one go.
comment:7 by , 16 years ago
Try to re-write the export to get reporter information in a separate query, cache it, and then merge it with the main query
This seems a lot more promising. Initial results indicate a performance increase with a factor up to at least 15. For example, MeV export of 80 bioassays used to take 20+ minutes now only takes 1.30. The performance gain is less for smaller bioassay sets (10 bioassays, 2.30 --> 0.30), but this is expected since we use the cached information a smaller number of times.
comment:8 by , 16 years ago
(In [4807]) References #1238: Check performance of BioAssaySetExporter
This "fix" makes the exporter use a separate query for all reporter data. Since this is always repeated for each bioassay we expect to save a lot of time by not including this in the main query. The test results indeed show a big performance increase in Export 2, 3 and 6 and no change in Export 1, 4, and 5. This is what was expected since 1, 4 and 5 doesn't include any reporter columns. Another interesting result is that the number of selected reporter columns doesn't seem to affect the time any more. Compare the results for Export 1/2/3 and Export 5/6.
by , 16 years ago
Attachment: | export-times-auster-4807.txt added |
---|
Export times after the fix in [4807]
by , 16 years ago
Attachment: | export-times-grey-ref.txt added |
---|
Reference measurements of BioAssaySetExporter on grey
comment:9 by , 16 years ago
The first test results from 'grey' have finally arrived. The results are more or less the same as on auster. One difference is that we imported 1000 raw bioassays instead of 100 and are doing export tests with 10, 20, 40, 80, 160, 320 and 640 bioassays. The main results from 'auster' are also valid for 'grey', but there are some new observations.
Number of bioassays
In some tests there is a huge increase in time as the number of bioassays go to 160 and above. The initial thought is that this is because MySQL can't perform the query with the avilable RAM but has to use a temporary file on disk. But this doesn't explain why the Serial export is slowing down since it only works with a single bioassay at a time. If it can do 80 bioassays in 2 minutes it should be able to do 640 bioassays in 16 minutes (see Export 5, which needs about 1 hour)
Next step
- Patch 'grey' with the fix in [4807] and redo the tests.
by , 16 years ago
Attachment: | export-times-grey-4807.txt added |
---|
Export times on 'grey' after the fix in [4807]
comment:10 by , 16 years ago
(In [4810]) References #1238: Check performance of BioAssaySetExporter
Cleaning up old code.
by , 16 years ago
Attachment: | reference-times-grey.txt added |
---|
Reference measurements of other plug-ins on 'grey'
comment:11 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:12 by , 16 years ago
(In [4820]) References #1238: Check performance of BioAssaySetExporter
Fixes a NullPointerException
comment:13 by , 16 years ago
(In [4821]) References #1238: Check performance of BioAssaySetExporter
Need special handling of 'position' column, since technically it is part of the spot data, but in reality it is used as a reporter field. The fix is needed since otherwise the position will not be written for spots that doesn't have a reporter.
comment:14 by , 16 years ago
(In [4841]) References #1238: Check performance of BioAssaySetExporter
Fixes NullPointerException
Initial tests seems to indicate that the main factor is the number of selected fields. On the standard 'roles' test data set (4 assays, 35912 spots each) I get:
This is just an indication. More tests with larger data sets are required. There may be other factors as well that doesn't show up until the data sets grow larger.