Opened 15 years ago

Closed 15 years ago

Last modified 15 years ago

#1238 closed task (fixed)

Check performance of BioAssaySetExporter

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: BASE 2.11
Component: coreplugins Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

This is slightly related to #1237 and may be caused by the same thing that is causing #903.

Attachments (6)

reference-times.txt (1.1 KB ) - added by Nicklas Nordborg 15 years ago.
Reference measurements of other plug-ins
export-times-auster-ref.txt (3.4 KB ) - added by Nicklas Nordborg 15 years ago.
Reference measurements of BioAssaySetExporter on auster
export-times-auster-4807.txt (2.9 KB ) - added by Nicklas Nordborg 15 years ago.
Export times after the fix in [4807]
export-times-grey-ref.txt (2.6 KB ) - added by Nicklas Nordborg 15 years ago.
Reference measurements of BioAssaySetExporter on grey
export-times-grey-4807.txt (2.9 KB ) - added by Nicklas Nordborg 15 years ago.
Export times on 'grey' after the fix in [4807]
reference-times-grey.txt (1.2 KB ) - added by Nicklas Nordborg 15 years ago.
Reference measurements of other plug-ins on 'grey'

Download all attachments as: .zip

Change History (20)

comment:1 by Nicklas Nordborg, 15 years ago

Description: modified (diff)

comment:2 by Nicklas Nordborg, 15 years ago

Description: modified (diff)

Initial tests seems to indicate that the main factor is the number of selected fields. On the standard 'roles' test data set (4 assays, 35912 spots each) I get:

  • ~10 seconds for both serial and matrix formats if ch1+ch2+externalId is selected, the setting for "merge on reporter" doesn't affect the result
  • ~70 seconds for MeV format.
  • ~70 seconds for both serial and matrix format if ch1+ch2+all reporter fields are selected.
  • ~30 seconds if ch1+ch2+about half of the reporter fields are selected.

This is just an indication. More tests with larger data sets are required. There may be other factors as well that doesn't show up until the data sets grow larger.

comment:3 by Nicklas Nordborg, 15 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

by Nicklas Nordborg, 15 years ago

Attachment: reference-times.txt added

Reference measurements of other plug-ins

comment:4 by Nicklas Nordborg, 15 years ago

I have now added the initial measurements which are made with a BASE version that is more or less identical to BASE 2.10. See the 'export-times-ref.txt' that is attached to this ticket.

Here is a summary of a few things that was found:

Number of selected reporter fields
This seems to be an important factor for the execution time. The more fields that are selected the longer it takes. The additional joins that are required to select just a single reporter field are not that important (compare Export 1 with Export 2 and Export 3). Since this information is repeated across bioassays a possible performance enhancement may be to split the export into two parts. The first parts select all reporter information and caches it in memory or in a file. The second part would then perform almost as Export 1.

Number of selected raw data fields
The execution time increases a lot if only a single raw data field is selected. It increases more if all raw data fields are selected, but not as much as with reporter fields. It seems like the additional joins that are required to access raw data are important (compare Export 1 with Export 4 and Export 5). I think we may need to analyze the SQL query to see if there may be any performance enhancements. The join involves the same number of tables as for reporter fields. The main difference is that the raw data table has a lot more records than the reporter table.

No big difference between file formats
There seems to be no real difference in execution time between the file formats. Note that MeV can only be compared with "Export 3" since MeV formats always include all reporter fields. This is a bit surprising since the structure of the SQL queries used by Matrix and Serial format are a bit different. The main difference is that Matrix requires a sort on the position column. Another difference is that the Matrix export only uses a single query but the Serial export uses one query for each bioassay. It may be that those two effects are cancelling each other. Experience has told us to keep the number of queries down and I think it would be possible to rewrite the Serial export to use only a single query. By watching the progress bar for the Matrix export it is clear that it uses about 80-90% of the time to execute the SQL query and only about 10-20% to iterate the result and generate the file.

Next steps

  • Add code in the 'performance' test program that add jobs for the 6 types of export
  • Repeat the tests with the filtered bioassay set
  • Try to re-write the serial export to only use a single query.
  • Try to re-write the export to get reporter information in a separate query, cache it, and then merge it with the main query

comment:5 by Nicklas Nordborg, 15 years ago

(In [4803]) References #1238: Check performance of BioAssaySetExporter

Added test code for generating export jobs.

by Nicklas Nordborg, 15 years ago

Attachment: export-times-auster-ref.txt added

Reference measurements of BioAssaySetExporter on auster

comment:6 by Nicklas Nordborg, 15 years ago

Try to re-write the serial export to only use a single query.

This seems to have no effect. Using 1 query for everything or 80 queries each returning 1/80th of the data uses more or less the same time. The sort on the position column happens in the serial case as well, but it seems like it doesn't have any effect on the time.

This result may seem to contradict previous experience. Particularly with the Lowess plug-in, which gained a lot in performance by decreasing the number of queries. But I think it can be explained by the columns that was used in the filter expression. In Lowess we filtered on an expression on the 'raw.block' number, but the BioAssaySetExporter filters on 'column'. The difference is that 'column' is part of the primary key but block number isn't. Thus, finding rows with a matching 'column' value is quick, but finding the correct block means that all data in a bioassay set must be proccessed for each query. The end result was that each query used the same as amount of time as a query that returned all data in one go.

comment:7 by Nicklas Nordborg, 15 years ago

Try to re-write the export to get reporter information in a separate query, cache it, and then merge it with the main query

This seems a lot more promising. Initial results indicate a performance increase with a factor up to at least 15. For example, MeV export of 80 bioassays used to take 20+ minutes now only takes 1.30. The performance gain is less for smaller bioassay sets (10 bioassays, 2.30 --> 0.30), but this is expected since we use the cached information a smaller number of times.

comment:8 by Nicklas Nordborg, 15 years ago

(In [4807]) References #1238: Check performance of BioAssaySetExporter

This "fix" makes the exporter use a separate query for all reporter data. Since this is always repeated for each bioassay we expect to save a lot of time by not including this in the main query. The test results indeed show a big performance increase in Export 2, 3 and 6 and no change in Export 1, 4, and 5. This is what was expected since 1, 4 and 5 doesn't include any reporter columns. Another interesting result is that the number of selected reporter columns doesn't seem to affect the time any more. Compare the results for Export 1/2/3 and Export 5/6.

by Nicklas Nordborg, 15 years ago

Export times after the fix in [4807]

by Nicklas Nordborg, 15 years ago

Attachment: export-times-grey-ref.txt added

Reference measurements of BioAssaySetExporter on grey

comment:9 by Nicklas Nordborg, 15 years ago

The first test results from 'grey' have finally arrived. The results are more or less the same as on auster. One difference is that we imported 1000 raw bioassays instead of 100 and are doing export tests with 10, 20, 40, 80, 160, 320 and 640 bioassays. The main results from 'auster' are also valid for 'grey', but there are some new observations.

Number of bioassays
In some tests there is a huge increase in time as the number of bioassays go to 160 and above. The initial thought is that this is because MySQL can't perform the query with the avilable RAM but has to use a temporary file on disk. But this doesn't explain why the Serial export is slowing down since it only works with a single bioassay at a time. If it can do 80 bioassays in 2 minutes it should be able to do 640 bioassays in 16 minutes (see Export 5, which needs about 1 hour)

Next step

  • Patch 'grey' with the fix in [4807] and redo the tests.

by Nicklas Nordborg, 15 years ago

Attachment: export-times-grey-4807.txt added

Export times on 'grey' after the fix in [4807]

comment:10 by Nicklas Nordborg, 15 years ago

(In [4810]) References #1238: Check performance of BioAssaySetExporter

Cleaning up old code.

by Nicklas Nordborg, 15 years ago

Attachment: reference-times-grey.txt added

Reference measurements of other plug-ins on 'grey'

comment:11 by Nicklas Nordborg, 15 years ago

Resolution: fixed
Status: assignedclosed

comment:12 by Nicklas Nordborg, 15 years ago

(In [4820]) References #1238: Check performance of BioAssaySetExporter

Fixes a NullPointerException

comment:13 by Nicklas Nordborg, 15 years ago

(In [4821]) References #1238: Check performance of BioAssaySetExporter

Need special handling of 'position' column, since technically it is part of the spot data, but in reality it is used as a reporter field. The fix is needed since otherwise the position will not be written for spots that doesn't have a reporter.

comment:14 by Nicklas Nordborg, 15 years ago

(In [4841]) References #1238: Check performance of BioAssaySetExporter

Fixes NullPointerException

Note: See TracTickets for help on using tickets.