Opened 13 years ago

Closed 12 years ago

#1630 closed task (fixed)

Migrate from MySQL to PostgreSQL

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 3.1
Component: install Version:
Keywords: Cc:

Description

We are thinking of moving our BASE installation from MySQL to PostgreSQL. It shouldn't be too complicated since BASE already has support for both databases. To make the move easier we'll need some new features. Here is the general outline:

  • Create an empty database on PostgreSQL. We should just create tables and columns and skip indexes, foreign keys, etc and no data should be inserted. I think a slight modification to the initdb.sh script should make this possible.
  • Dump the existing MySQL database to a format that can be read by PostgreSQL's COPY command. This shouldn't be too complicated given that we only use vanilla column types. Dates and booleans are probably the most difficult data types. One dump file per table should make the import to PostgreSQL simple.
  • Import the dumped data into PostgreSQL. This should be straightforward for the static database. We'll need some metadata about the tables in the dynamic part since they need to be created. If possible, the dump files could include the CREATE TABLE part as well.
  • Create indexes and foreign keys. Ensure that sequences are properly reset so that they don't generate id values that have already been used.

Change History (28)

comment:1 by Nicklas Nordborg, 12 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

comment:2 by Nicklas Nordborg, 12 years ago

(In [5852]) References #1630: Migrate from MySQL to PostgreSQL

Implemented the export functionality. It will currently only work on a MySQL database and generates a data file for each table. Hopefully, the files can be read by PostgreSQL, but we need to be able to create an empty database without foreign keys first...

comment:3 by Nicklas Nordborg, 12 years ago

(In [5853]) References #1630: Migrate from MySQL to PostgreSQL

Forgot to add the 'Migration.java' file in the last checkin...

Parts of the import functionality has been done. It should create a new empty database, and import the static part of the database. The dynamic part is not imported, and no foreign keys or indexes are created.

comment:4 by Nicklas Nordborg, 12 years ago

(In [5854]) References #1630: Migrate from MySQL to PostgreSQL

The import step now import all data and create missing foreign keys and unique indexes. The resulting database should be usable in a BASE installation. There are still some error handling to implement.

comment:5 by Nicklas Nordborg, 12 years ago

(In [5855]) References #1630: Migrate from MySQL to PostgreSQL

Using a file to store which columns that are exported. This will make it easier to detect if there is a configuration mismatch between the source and destination databases (eg. different columns in some tables).

Set 'iso' datestyle when importing since the default is client-dependent.

Explicitely set 'UTF8' encoding when importing since the default is client-dependent.

comment:6 by Nicklas Nordborg, 12 years ago

(In [5856]) References #1630: Migrate from MySQL to PostgreSQL

Documented the migration process. This should more or less complete this feature. I'll keep the ticket open until we have made some more tests.

comment:7 by Nicklas Nordborg, 12 years ago

(In [5857]) References #1630: Migrate from MySQL to PostgreSQL

Testing on the data from the demo server found a few issues:

  • The unique index creation at the end of the migration was interfering with the regular update script. It tried to create indexes on tables that did not yet exist. So this has been refactored, and the migration now uses the regular update first, and then a special mode for the unique index creation.

  • File-only platforms (raw data types) were not loaded and resulted in a NullPointerException if experiments using that platform had tables in the dynamic database. The platform raw data types are now re-initialized after the static database has been populated.

  • Added some boolean flag to the different modes, since it will make it easier to use different checks if we add more modes in the future.

comment:8 by Nicklas Nordborg, 12 years ago

(In [5865]) References #1630: Migrate from MySQL to PostgreSQL

Added instructions to run 'updateindexes.sh' as the last step.

comment:9 by Nicklas Nordborg, 12 years ago

(In [5867]) References #1630: Migrate from MySQL to PostgreSQL

Added note about superuser privileges requirement for PostgreSQL.

comment:10 by Nicklas Nordborg, 12 years ago

(In [5880]) References #1630: Migrate from MySQL to PostgreSQL

More info in progress reporter when counting rows.

comment:11 by Nicklas Nordborg, 12 years ago

(In [5882]) References #1630: Migrate from MySQL to PostgreSQL

Use approximate number of rows instead of counting the actual number of rows when exporting. The reason is that counting can take a long time (>1 hour on a big database) and is only used for progress reporting in to the console. The approximate row count is good enough for the progress reporting.

Fixed an issue with '0' characters in some strings when exporting. Importing to PostgreSQL caused the following error:

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0x00
  Where: COPY RawBioAssayHeaders, line 58878

comment:12 by Nicklas Nordborg, 12 years ago

(In [5883]) References #1630: Migrate from MySQL to PostgreSQL

Changed the export to skip exporting data if the file already exists. The main reason is to be able to create another export for a single table that has failed to be imported (eg. the '0'-value problem mentioned in [5882]) without having to run a full export.

comment:13 by Nicklas Nordborg, 12 years ago

(In [5884]) References #1630: Migrate from MySQL to PostgreSQL

Updated documentation with the changed behaviour for existing files.

comment:14 by Nicklas Nordborg, 12 years ago

The import performance is not so good for large tables. A test migration of our production server has been stuck on the following SQL statement for the almost 20 hours now:

alter table "RawDataGenePix" add constraint FK529CB218EAB01E4E 
foreign key ("reporter_id") references "Reporters" (id)

While waiting for this to finish I have been doing some tests with a smaller data set. Running an "ANALYZE" command on all tables seem to have a good effect on performance. I could get it down from 24 to 5 minutes. Running the foreign key creating within the same transaction that is doing the data import also seem to make it quicker. Don't know if this apply to a full-scale migration though, since the data set is much larger (eg. the small data set may be cached in memory).

comment:15 by Nicklas Nordborg, 12 years ago

Seems like there can be some time to save by not including a primary key definition in the CREATE TABLE statement. This is how the dumps generated by PostgreSQL do it:

CREATE TABLE ... -- without a primary key definition
COPY ...
ALTER TABLE ... ADD PRIMARY KEY

I'll run some more tests...

comment:16 by Nicklas Nordborg, 12 years ago

Here is some more detailed results from my testing. The test procedure uses the dump for the Reporters and RawDataAgilent tables from our production server. The reporters table has 2.5 million rows and the agilent table has 10.7 million rows.

Here is the test procedure when the primary key is created first.

  1. Create the tables (with primary key)
  2. COPY to RawDataAgilent
  3. Optionally run ANALYZE
  4. COPY to Reporters
  5. Optionally run ANALYZE
  6. Create foreign key from RawDataAgilent.reporter_id -> Reporters.id

Results:

  • COPY data only (step 1, 2 and 4): ~ 7 minutes
  • COPY + FOREIGN KEY (1, 2, 4 and 6): ~35 minutes (=current implementation)
  • COPY + ANALYZE (1-5): ~10.5 minutes
  • COPY + ANALYZE + FOREIGN KEY (1-6): ~11.5 minutes

So, copying the data takes around 7 minutes. ANALYZE-ing it adds about 3.5 minutes but removes 27 minutes from the foreign key creation.

And some more tests when the primary key is created after the data import:

  • COPY data only: ~6 minutes
  • COPY + PRIMARY KEY: ~6.5 minutes
  • COPY + PRIMARY KEY + ANALYZE + FOREIGN KEY: ~9 minutes
  • COPY + ANALYZE + PRIMARY KEY + FOREIGN KEY: ~8.5 minutes
  • COPY + PRIMARY KEY + FOREIGN KEY: ~8 minutes

So, it seems like there is a couple of more minutes to gain if the primary key is created after all data has been copied. Note that everything is done in a single transaction. Splitting the steps into multiple transactions have the reverse effect.

The surprising thing here is that it gets faster again if we leave out the ANALYZE part... This could either be because the primary key creation is doing the same thing as analyze (maybe as a side-effect), or it could be a caching issue. In the latter case, it is not certain that leaving out ANALYZE scale to a bigger data set.

comment:17 by Nicklas Nordborg, 12 years ago

(In [5888]) References #1630: Migrate from MySQL to PostgreSQL

Implemented some of the performance enhancements discussed above. The migration now works like this:

  • Create tables (with primary key, since that is done by Hibernate). Foreign keys are skipped (just as before) but collected for later use.
  • Drop the primary key (keep info about name and columns)
  • Copy data into the database
  • Analyze the table
  • Re-create the primary key
  • When all tables in the static table has been filled with data, create foreign keys collected from the first step.
  • Continue with the dynamic database using the same procedure as above.
  • Let Hibernate check if there are items missing (used to create missing foreign keys, but shoul not do anything now)
  • Create missing unique indexes.

So... I hope this will perform better. We'll start another test run next week on a copy of the production server (by the way... it is till executing the same SQL.. 28 hours and counting)

In the meantime, I'll see if I can cleanup the code... it is beginning to look a bit messy...

comment:18 by Nicklas Nordborg, 12 years ago

Test migration finished in about 60 hours... Think we need to redo it with some optimizations next week.

comment:19 by Nicklas Nordborg, 12 years ago

(In [5889]) References #1630: Migrate from MySQL to PostgreSQL

Major refactoring of the migration functionality.

  1. We new create the schema just as for a normal installation. The migration import is then responsible for dropping things that may get in the way (foreign keys) or affect performance (primary keys, indexes, etc). After the data has been imported it is the responsibility of the migration code to re-create everything that was removed. This approach feels a lot more robust that having (too) many different installation modes.

  1. Change the import to create it's own database connection instead of relying on Hibernate. This means that we can use COPY...FROM STDIN instead which means that we can now support remote database servers without relying on a common filesystem. Also, the database user no longer need to be superuser (documentation need to be updated).

  1. Compress (with gzip) exported data and then de-compress on the fly while importing. This can save a lot of disk space and maybe also increase performance due to less i/o.

  1. Progress reporting while importing has been improved. It should be updated much more often when possible.

comment:20 by Nicklas Nordborg, 12 years ago

(In [5890]) References #1630: Migrate from MySQL to PostgreSQL

Updated the migration documentation.

comment:21 by Nicklas Nordborg, 12 years ago

(In [5891]) References #1630: Migrate from MySQL to PostgreSQL

Using a separate thread for gzip compression/decompression. Hopefully this should make the export/import a bit faster on multi-processor machines.

comment:22 by Nicklas Nordborg, 12 years ago

(In [5893]) References #1630: Migrate from MySQL to PostgreSQL

Made it possible to configure some options in the base.config file (need to write documentation!). Seems like the option to gzip the files is not performing so good. The compression is slower than reading from the database so it takes longer time to export. Ok, we save a lot of disk space, but the total time might be more important for the big production server.

comment:23 by Nicklas Nordborg, 12 years ago

(In [5894]) References #1630: Migrate from MySQL to PostgreSQL

Some enhancements for the export:

  • Use a fetch size for JDBC. The default is 20000 but this can be configured in base.config -> migrate.export.fetch-size. Larger size can improve performance but uses more memory.
  • Write each file to a temporary file first. This means that an aborted/crashed export can be resumed where it left off (assuming that no changes have been made to the database).
  • A bit more efficient progress reporting and row counting.

comment:24 by Nicklas Nordborg, 12 years ago

Just a note about the performance of VirtualBox. We have been testing the migration on a VirtualBox virtual machine. At first the virtual machine only had a single CPU core assigned to it (out of 2*4). The performance was ok, but we figured it could be better so we assigned 4 cores to the virtual machine. A few days passed and when we returned to testing it was sooo slow. We didn't connect this with the number of CPUs, but after lots of other testing we switched back to a single CPU and performance got a lot better. With some other performance improvements found, the current implementation seems more or less limited by disk speed.

comment:25 by Nicklas Nordborg, 12 years ago

(In [5895]) References #1630: Migrate from MySQL to PostgreSQL

Documentation configuration options for the migration program.

comment:26 by Nicklas Nordborg, 12 years ago

(In [5896]) References #1630: Migrate from MySQL to PostgreSQL

Refactored migration exporter to use special implementations for each column type which can use more exact calls to get data from the database (eg. getInt() instead of getObject()). String escaping has been made more efficient by using lookup arrays instead of regular expressions. Initial tests indicate a performance gain of 20-25%.

A side-effect is that it should not be too difficult to make the exporter also generate data files that can be imported into MySQL. Eg. we only need a few different implementations that format data a bit differently. The importer may require more work though, so this is nothing we are going to do right now.

Progress reporting is now weighted with the number of columns in each table. This seems to better reflect the actual time it takes. The reporter and raw data tables usually takes much longer time per row since they have more columns than other tables.

comment:27 by Nicklas Nordborg, 12 years ago

(In [5898]) References #1630: Migrate from MySQL to PostgreSQL

Need two more spaces to erase large numbers.

comment:28 by Nicklas Nordborg, 12 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.