Batch processing

Contents

  1. Diagram of classes and methods
  2. Initialising the batch system
  3. Using the batch system
  4. Creating a batchable data class
  5. Creating the batcher class
  6. Creating the utility class

See also

Last updated: $Date: 2009-04-06 14:52:39 +0200 (må, 06 apr 2009) $

1. Diagram of classes and methods

2. Initialising the batch system

The batch system is initialised by the Application class when the application is started. The BatchUtil:s init() checks all classes registered with Hibernate if the implement the BatchableData interface or not. For all classes that implement that interface SQL statements for insert, update and delete are generated. We use the information about database tables, columns and dialect found in the Hibernate configuration. The BatchUtil also keeps track of the property name/parameter order coupling so we can properly fill the prepared statements with the correct properties.

There are some restrictions on which classes that can be batchable. A class must be mapped to only one table and a property of the class must be a simple one, i.e. the property can't be divided into several columns. Many-to-one and one-to-one associations are supported but not collections or arrays. If the class is within those restriction it only has to implement BatchableData to be batchable.

3. Using the batch system

The batching system is used for inserting, updating and/or deleting a large amount of data in an effective way. The disadvantage with this system is that it won't synchronize the objects with the the database. E.g. at insert the ID of the object won't be set. Only classes that implements BatchableDatatagging interface can be batched.

A single type of batchable data object usually has three classes, a data class, a batcher class and a utility class. The data class is the actual class holding the data and which is mapped with Hibernate. The batcher class is used for doing the insert, update and delete to the database. The utility class is used to get new data objects, and to get access to otherwise protected properties. We need the utility class to bridge the gap between classes in the net.sf.basedb.core.data package and their corresponding class in the net.sf.basedb.core package. For example, A reporter may have a reporter type. The reporter type is not a batchable class and we must always work with the net.sf.basedb.core.ReporterType class and not the net.sf.basedb.core.data.ReporterTypeData class. But the ReporterData can only return ReporterTypeData objects. Therefore we need the Reporter class to convert the ReporterTypeData object into a ReporterType.

Here is an example of how to insert reporters:

// Get a DbControl
DbControl dc = ...;

// Create batcher for reporters
ReporterBatcher rb = ReporterBatcher.getNew(dc);

// Create a new reporter with external id
ReporterData rd = Reporter.getNew("reporter1");

// Adds the object to the batch insertion queue
rb.insert(rd); 

// Execute all the inserts
dc.commit();

insert() adds the object to a batch that is sent to the database when the batcher is flushed. The batcher will be flushed when DbControl.commit() or Batcher.flush() is called. You can also set a batch size that will make the batcher flush automatically when it reach the size limit of the batch. A default batch size is configured in the base.config file.

rb.setBatchSize(500); // Will flush when 500 objects have been added

update() and delete() work as insert() does with their own batch.

ReporterData rd = Reporter.getByExternalId(dc, "reporter1");
rd.setExtended("species", "Mus musculus");
rb.update(rd);
dc.commit();
dc.close();
ReporterData rd = Reporter.getByExternalId(dc, "reporter1");
rb.delete(rd);
dc.commit();
dc.close();

Insert, update and delete have all their own batches and can be flushed separately.

rb.flushInsert();
rb.flushUpdate();
rb.flushDelete();
rb.flush(); // Will flush all batches

4. Creating a batchable data class

There are some restrictions on which classes that can be batchable. A class must be mapped to only one table and a property of the class must be a simple one, i.e. the property can't be divided into several columns. Many-to-one and one-to-one associations are suported but not collections or arrays. If the class is within those restriction it only has to implement the BatchableData interface to be batchable.

public class Foo
   implements BatchableData
{
   // ...
}

Many-to-one or one-to-one associations to non-batchable classes must have package private methods. We don't want to expose the data layer objects of those classes. The utility class is used to get/set those properties using Hibernate metadata methods.

The associated class must have proxies enabled. The reason is that batchable items can be loaded by a stateless session which doesn't have a first-level cache and doesn't use the second-level cache. This would cause all many-to-one and one-to-one associations to be fetched by extra SQL statements (bad) unless proxies are enabled.
private ReporterTypeData reporterType;
/**
   Get the {@link ReporterTypeData} of this the reporter. Package private since
   we cannot expose the data object to client applications.
   @return The ReporterTypeData item
   @hibernate.many-to-one column="`reportertype_id`" not-null="false" 
      outer-join="false"
   @see Reporter#getReporterType(net.sf.basedb.core.DbControl, ReporterData)
*/
ReporterTypeData getReporterType()
{
   return reporterType;
}
void setReporterType(ReporterTypeData reporterType)
{
   this.reporterType = reporterType;
}

Associations to other batchable classes may have public get/set methods. In this case we must instead map the association with a cascade="evict" attribute. This will make sure that once the object reaches the client application it is not associated with any session and changes to it will not propagate to the database bypassing regular permission checks.

There is one catch however. The version of XDoclet we currently use doesn't support the cascade="evict" attribute. Therefore we must skip the Hibernate mapping for such properties and add it in an external xml file. For example, to use an external mapping for the AbstractFeatureData class the name of the external file should be hibernate-properties-AbstractFeatureData.xml.

<many-to-one
   name="reporter"
   class="net.sf.basedb.core.data.ReporterData"
   cascade="evict"
   fetch="select"
   update="false"
   insert="true"
   access="property"
   column="`reporter_id`"
   not-null="false"
/>

5. Creating the batcher class

Every batchable class must have their own batcher, that is because there are things that are class specific. A batcher always inherit from AbstractBatcher or BasicBatcher which provides the core services. The BasicBatcher is used for all items mapped with Hibernate. The Dynamic API has batchers that inherits from the AbstractBatcher class.

getNew()
This should be a static method that creates a new batcher object. The most important thing is that it must call the initPermissions() method or no permissions will be set, resulting in a PermissionDeniedException when trying to use the batcher. The BasicBatcher fetches the role-based permissions for the logged in user. This means that batchable items doesn't have unique permissions on them, they are always treated as a group.
// ReporterBatcher.java
public static ReporterBatcher getNew(DbControl dc)
   throws BaseException
{
   ReporterBatcher rb = new ReporterBatcher(dc);
   rb.initPermissions(0, 0);
   return rb;
}
getType()
The getType() usually returns only a constant from the Item enumeration. The returned value is used for role-based permission checking.
validate()
The validate() method should validate the properties of a data object. It is called by the BasicBatcher before an object is inserted or updated. The validation should follow the rules for case 2 validation as discussed in the data validation document and the coding rules and guidelines document.
onBeforeCommit()
The onBeforeCommit() method is called after validation just before the object is added to the insert or update batch. It can be useful for a batcher to override this method in case it needs to modify some property values, for example, set the last updated date. The BasicBatcher automatically increments the version property. The onBeforeCommit() method is not called when an object is deleted, since it is possible to delete an object from the id value.
// ReporterBatcher.java
void onBeforeCommit(ReporterData data, Transactional.Action action)
   throws BaseException
{
   setPropertyValue(data, "lastUpdate", new Date());
}
onBeforeClose()
This method is defined by the AbstractBatcher class and is called by the close() method after it has called flush() but before the connection to the database has been closed. This allows a subclass to cleanup any open resources and (more importantly) update properties on parent items. For example the RawDataBatcher updates the spot count and disk usage on the raw bioassay. This method is only called one time. Once the batcher has been closed, it cannot be used again.
// RawDataBatcher.java
void onBeforeClose()
   throws BaseException
{
   rawBioAssayData.setSpots(rawBioAssayData.getSpots() + 
      getTotalInsertCount());
   rawBioAssayData.setBytes(rawBioAssayData.getBytes() + bytes);
}
initPermissions()
The initPermissions() sets the permissions for the entire batcher and is called once directly after the batcher is created. The permissions apply to all objects handled by the batcher. It is not possible to have different permissions for different objects. For reporters this is not a problem, since the permissions are given by role permissions only. But for raw data, the permissions depend on the raw bioassay they belong to. This is solved by letting a single batcher only handle raw data for a single raw bioassay. By giving a raw bioassay as a parameter when creating the batcher the permissions are known.
// RawDataBatcher.java
RawDataBatcher(DbControl dc, RawBioAssay rawBioAssay)
   throws BaseException
{
   super(dc);
   this.rawBioAssay = rawBioAssay;
   ...
}
...

void initPermissions(int granted, int denied)
   throws BaseException
{
   if (rawBioAssay.hasPermission(Permission.READ))
   {
      granted |= Permission.grant(Permission.READ);
   }
   if (rawBioAssay.hasPermission(Permission.WRITE))
   {
      granted |= Permission.grant(Permission.WRITE, Permission.DELETE, 
         Permission.CREATE);
   }
   super.initPermissions(granted, denied);
}
Note! Do not forget to call super.initPermissions().

4. Creating the utility class

The utility class is used to get/set properties which doesn't have public methods in the data layer. The reason that the methods not are public are usually that we don't want the data object to be exposed to client applications.

// Reporter.java
public static ReporterType getReporterType(DbControl dc, ReporterData reporter)
   throws PermissionDeniedException, BaseException
{
   ReporterTypeData rtd = (ReporterTypeData)metaData.getPropertyValue(reporter, 
      "reporterType", EntityMode.POJO);
   return (ReporterType)dc.getItem(ReporterType.class, rtd);
}

public static void setReporterType(ReporterData reporter, ReporterType reporterType)
   throws PermissionDeniedException, BaseException
{
   if (reporterType != null) reporterType.checkPermission(Permission.USE);
   metaData.setPropertyValue(reporter, "reporterType", 
      reporterType == null ? null : reporterType.getData(), EntityMode.POJO);
}

The utility class can also be used as a place for static getNew(), getById() and getQuery() methods if there is no other natural place for those methods. For example, the Reporter class has all those methods as well as a getByExternalId() method. On the other hand the RawDataUtil class doesn't have those methods since the natural place is the RawBioAssay class.

public static ReporterData getNew(String externalId)
   throws InvalidDataException, BaseException
{
   if (externalId == null)
   {
      throw new InvalidUseOfNullException("externalId");
   }
   ReporterData rd = new ReporterData();
   rd.setExternalId(externalId);
   rd.setName("New reporter");
   return rd;
}
public static DataQuery<ReporterData> getQuery()
{
   return new DataQuery<ReporterData>(ReporterData.class);
}