Opened 9 years ago

Closed 9 years ago

#2000 closed enhancement (fixed)

Batch API for annotation handling

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: critical Milestone: BASE 3.8
Component: core Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

When updating a large number of annotation on a large number of items it easy to run into problems:

  • The first-level cache in Hibernate can easily use up all available memory
  • Dirty-checking and SQL execution by Hibernate takes a long time
  • If the change history logging is enabled, this also takes a long time

The current annotation importer plug-in has been used as for testing. It was used to import values for 140+ different annotation types to 4900+ items (samples).

The data file is 4MB large. The work done by the annotation importer can be divided into the following steps. JConsole is used to check the memory usage and debug output to check the time.

Action Time Memory
Parse the file and find the item to update (loaded by ID) 7 sec ~500MB
Update annotations 5 min ~500MB -> 1.5GB
Commit - Hibernate 12 min ~1.5GB
Commit - Change log 13 min ~1.5GB -> 1.9GB

CPU usage may also be interesting. This is usually below 10% (less than a full single core). The CPU usage for Postgres is in the same range.

The main problems here are that the memory usage grows in the second step and that the last two steps takes a long time.

In theory it should be possible to improve the second step a lot since in this stage the annotation importer is only working with a single item at a time. We do not need Hibernate to keep things in the first-level cache. If we can manage this it may be that the Hibernate commit step is also automatically solved. The change log step may be harder, since we are already using the stateless session here. However, it is maybe possible to replace this with our own batch SQL implementation as we have done for reporters and raw data already.

It turned out that using the annotation importer to delete the existing annotations proved to be much worse. The initial parsing and updating of items used about the same time and amount of memory as when creating items. When committing BASE need to go through relations that may point to the deleted items and either delete them as well or nullify the reference (for example, any-to-any links and inherited/cloned annotations). This consumed more and more memory and reached a point where most of the time was spent doing GC. After 1.5 hours (60 minutes GC) I gave up and killed Tomcat. I'll see what happens if Tomcat get more memory...

Giving Tomcat 4GB instead of 2GB memory helped. The maximum low level was near 3GB. The steps outlined in the table above took more or less the same time as when inserting annotations. An additional hour was spent checking/removing references to the deleted annotation. Total time was over 1 hour 20 minutes.

Final note

After all changes in this ticket and in #2002 has been made the annotation importer has improved a lot. Using the same test data as in the table above, the time for importing new annotations is typically 4-5 minutes and for deleting 6-7 minutes. Memory usage is well below 1GB most of the time and garbage collection seems to be able to clean up so that no more than 0.5GB remains.

Change History (20)

comment:1 by Nicklas Nordborg, 9 years ago

Component: webcore
Description: modified (diff)
Priority: majorcritical

comment:2 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

comment:3 by Nicklas Nordborg, 9 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

comment:4 by Nicklas Nordborg, 9 years ago

(In [7121]) References #2000: Batch API for annotation handling

First version of the new AnnotationBatcher implementation. It should have support for creating, updating and deleting annotation values.

It lacks a lot functionality:

  • It doesn't check annotation values (eg. if a String is a String, if a values is allowed by the an enumerated annotation, etc.)
  • It doesn't handle units
  • Changes are not recorded in the change log

The AnnotationFlatFileImporter has been modified to use the batcher. All possible options for updating are supported (eg. merge).

Version 0, edited 9 years ago by Nicklas Nordborg (next)

comment:5 by Nicklas Nordborg, 9 years ago

Initial test results are promising. Inserting the same number of annotations as in the table above are down to 4 minutes (without change history). Deleting the annotations again is down to 6 minutes. Garbage collection time is only a few seconds and memory usage can be as low as 0.5GB.

comment:6 by Nicklas Nordborg, 9 years ago

(In [7122]) References #2000: Batch API for annotation handling

Check that annotation values are valid as specified by the annotation type.

Added support for units.

comment:7 by Nicklas Nordborg, 9 years ago

(In [7123]) References #2000: Batch API for annotation handling

Re-factored the annotation batcher to be based on the AbstractBatcher class. This helps us hook into the transaction and re-act to commit and rollbacks.

Annotation snapshots are not deleted until the batcher is closed which minimizes the problem with a parallell transaction that re-creates the snapshot from the old data.

Improved logging.

comment:8 by Nicklas Nordborg, 9 years ago

(In [7124]) References #2000: Batch API for annotation handling

Fixes a NullPointerException when updating and unit is null.

comment:9 by Nicklas Nordborg, 9 years ago

(In [7125]) References #2000: Batch API for annotation handling

Added a cache with unit converters so we don't have to create a new converter for every value.

More cleanup when closing the batcher.

Added setValue() method for setting single-valued annotations.

comment:10 by Nicklas Nordborg, 9 years ago

(In [7126]) References #2000: Batch API for annotation handling

Adding a call to checkBatchAnnotatableUsage() in all classes implementing the getAnnotationSet() method to prevent using bothe batch API and the regular API in the same transaction.

comment:11 by Nicklas Nordborg, 9 years ago

(In [7127]) References #2000: Batch API for annotation handling

Several bug fixes and improvements.

In AnnotationBatcher:

  • Evict items from the second-level cache in Hibernate if we make any changes (eg. create a new annotation set)
  • Add support for merging values for annotation types that have unlimited number of annotations.
  • Fixes issues with setting the correct unit for annotation types that supports unit. Now the unit should always be set, either to the default unit or the unit that was specified.

In AnnotationFlatFileImporter:

  • Merge is supported
  • Ignore annotations if there are too many values
  • Not removing annotations when no values have been specified

comment:12 by Nicklas Nordborg, 9 years ago

(In [7128]) References #2000: Batch API for annotation handling

Added support for change history logging. The logging is implemented by submitting the CurrentAnnotationInfo instance to the LoggingInterceptor. To make this work the logging system need to be updated to immediately do the logging instead of keeping all changes in a temporary cache.

The existing AnnotationLogger has been updated so that it is able to take care of both the regular logging and the special logging.

Initial tests indicate that the loggin add 2-3 minutes to the total time both when adding and deleting annotations. Adding is now up to 7 minutes and deleting up to 10 minutes. This can probably be reduced by implementing the logging also as a batcher. But this is probably another ticket...

comment:13 by Nicklas Nordborg, 9 years ago

(In [7129]) References #2000: Batch API for annotation handling

Progress reporting now assign the first 50% to file parsing and the last 50% to database actions. It is still a bit unbalanced but better that before when it was 75%/25%.

Removed debug output.

comment:14 by Nicklas Nordborg, 9 years ago

(In [7131]) References #2000: Batch API for annotation handling

Do not log old or new annotation values when disabled by the annotation type.

comment:15 by Nicklas Nordborg, 9 years ago

Description: modified (diff)

comment:16 by Nicklas Nordborg, 9 years ago

(In [7133]) References #2000: Batch API for annotation handling

The annotation batcher wasn't deleting the same snapshot multiple times instead of the snapshots it was assumed to delete.

comment:17 by Nicklas Nordborg, 9 years ago

(In [7134]) References #2000: Batch API for annotation handling

Fixes a major bug where the annotation batcher executed the SQL for INSERT before DELETE causing newly inserted annotation values to be deleted immediately. This affected the case when updating an existing annotation since this is the only case where both the delete and insert statements are used.

comment:18 by Nicklas Nordborg, 9 years ago

(In [7135]) References #2000: Batch API for annotation handling

Added a missing SQL command to mysql-queries.xml.

comment:19 by Nicklas Nordborg, 9 years ago

(In [7136]) References #2000: Batch API for annotation handling

The SQL was not correct. Remove [id].

comment:20 by Nicklas Nordborg, 9 years ago

Resolution: fixed
Status: assignedclosed

All tests have now been completed successfully.

Note: See TracTickets for help on using tickets.