Opened 10 years ago
Closed 10 years ago
#1878 closed enhancement (fixed)
Reduce memory footprint in raw data importer
Reported by: | Nicklas Nordborg | Owned by: | everyone |
---|---|---|---|
Priority: | major | Milestone: | BASE 3.3.3 |
Component: | coreplugins | Version: | |
Keywords: | Cc: |
Description
The memory footprint of the raw data importer can be up to several hundred megabytes when the raw bioassay is connected to an array design. The reason is that the features and reporters on the array design are pre-loaded (for performance reasons) and kept in memory.
It is however not necessary to keep all information in memory since we only need a few properties (eg. feature id, position and reporter id). These properties could be kept in a special object and which can then be used as a proxy object for the real ReporterData
and FeatureData
object when inserting to the database.
Change History (3)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
I have made some more tests after replacing a feature/reporter objects with proxy objects that only hold the minimal information that is needed. When only 1 import was running there was not much difference, but I was yet able to run 7 imports in parallel with about the same memory footprint as 3 parallel imports used before. I guess the reason is that with only 1 import there was no need for garbage collection to kick in to clean up what the preload has loaded but no longer used, but with 7 imports the garbage collection could reclaim memory.
comment:3 by , 10 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
(In [6586]) Fixes #1878: Reduce memory footprint in raw data importer
FeatureInfo
instances are used to store the information we need after the preload. Single instances of ReporterProxy
and FeatureProxy
are then populated with required values as they are needed.
Note that this only works because the batcher will process each raw data entry immediately and add it to the batch queue. If it had not, then the next raw data entry would have overwritten the values for the one before it causing incorrect reporter/feature mappings for the raw data.
This was tested with the UCSC_hg38_knownGenes_22sep2014 array design that contains 104178 features. Memory footprint was around 200-300MB for the importer. It was more or less consistent when testing with 1-3 parallel imports.