Opened 9 years ago

Closed 9 years ago

#1878 closed enhancement (fixed)

Reduce memory footprint in raw data importer

Reported by: Nicklas Nordborg Owned by: everyone
Priority: major Milestone: BASE 3.3.3
Component: coreplugins Version:
Keywords: Cc:

Description

The memory footprint of the raw data importer can be up to several hundred megabytes when the raw bioassay is connected to an array design. The reason is that the features and reporters on the array design are pre-loaded (for performance reasons) and kept in memory.

It is however not necessary to keep all information in memory since we only need a few properties (eg. feature id, position and reporter id). These properties could be kept in a special object and which can then be used as a proxy object for the real ReporterData and FeatureData object when inserting to the database.

Change History (3)

comment:1 by Nicklas Nordborg, 9 years ago

This was tested with the UCSC_hg38_knownGenes_22sep2014 array design that contains 104178 features. Memory footprint was around 200-300MB for the importer. It was more or less consistent when testing with 1-3 parallel imports.

comment:2 by Nicklas Nordborg, 9 years ago

I have made some more tests after replacing a feature/reporter objects with proxy objects that only hold the minimal information that is needed. When only 1 import was running there was not much difference, but I was yet able to run 7 imports in parallel with about the same memory footprint as 3 parallel imports used before. I guess the reason is that with only 1 import there was no need for garbage collection to kick in to clean up what the preload has loaded but no longer used, but with 7 imports the garbage collection could reclaim memory.

comment:3 by Nicklas Nordborg, 9 years ago

Resolution: fixed
Status: newclosed

(In [6586]) Fixes #1878: Reduce memory footprint in raw data importer

FeatureInfo instances are used to store the information we need after the preload. Single instances of ReporterProxy and FeatureProxy are then populated with required values as they are needed.

Note that this only works because the batcher will process each raw data entry immediately and add it to the batch queue. If it had not, then the next raw data entry would have overwritten the values for the one before it causing incorrect reporter/feature mappings for the raw data.

Note: See TracTickets for help on using tickets.