Opened 16 years ago

Closed 15 years ago

#903 closed enhancement (fixed)

Experiment Explorer is slow

Reported by: base Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 2.11
Component: core Version:
Keywords: Cc:

Description

Here's some code which is pretty much identical to http://base.thep.lu.se/browser/branches/2.5-stable/src/clients/web/net/sf/basedb/clients/web/ExperimentExplorer.java around line 730...SPOT_ALL

DynamicSpotQuery spot_query = bas.getSpotData();
spot_query.joinReporters(JoinType.INNER);
spot_query.restrict(
	Restrictions.eq(
		Dynamic.reporter("id"),
		Expressions.integer(reporter.getId())
		));
DynamicResultIterator spots = spot_query.iterate(dc);

When I run the above code and grab the SQL from mysql "show processlist", and run it, it takes 1.7 seconds (there are 54 bioassays in the bioassayset). The experiment explorer GUI also takes a few seconds to go from one reporter to another.

Here's the "explain".

EXPLAIN SELECT spt.column AS column, spt.position AS position, spt.ch1 AS ch1, spt.ch2 AS ch2 FROM base2dynamicdev.D2Spot spt INNER JOIN base2dynamicdev.D2Filter AS flt ON flt.cube = spt.cube AND flt.column = spt.column AND flt.position = spt.position INNER JOIN base2dynamicdev.D2Pos AS pos ON pos.cube = spt.cube AND pos.position = spt.position INNER JOIN Reporters AS rpt ON rpt.id = pos.reporter_id WHERE spt.cube = 3 AND spt.layer = 1 AND flt.filter = 1 AND (rpt.id = 12003)

+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------------------------------------+--------+--------------------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                                                                 | rows   | Extra                    |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------------------------------------+--------+--------------------------+
|  1 | SIMPLE      | rpt   | const  | PRIMARY       | PRIMARY | 4       | const                                                               |      1 | Using index              |
|  1 | SIMPLE      | spt   | ref    | PRIMARY       | PRIMARY | 4       | const,const                                                         | 482406 |                          |
|  1 | SIMPLE      | pos   | eq_ref | PRIMARY       | PRIMARY | 6       | const,base2dynamicdev.spt.position                                  |      1 | Using where              |
|  1 | SIMPLE      | flt   | eq_ref | PRIMARY       | PRIMARY | 10      | const,const,base2dynamicdev.spt.column,base2dynamicdev.pos.position |      1 | Using where; Using index |
+----+-------------+-------+--------+---------------+---------+---------+---------------------------------------------------------------------+--------+--------------------------+

I'm using BASE 2.4.5 with some indices removed during a recent upgrade (by running dropindexes.sh - I can't figure out easily if that is relevant or not right now). If there are any relevant changes in 2.5 I will make the upgrade my number 1 priority and rerun these tests.

I notice also that a root bioassayset (where there is no need to query a filter layer) is not actually any quicker (also around 1.7 seconds).

EXPLAIN SELECT spt.column AS column, spt.position AS position, spt.ch1 AS ch1 , spt.ch2 AS ch2 FROM base2dynamicdev.D2Spot spt INNER JOIN base2dynamicdev.D2Pos AS pos ON pos.cube = spt.cube AND pos.position = spt.position INNER JOIN Reporters AS rpt ON rpt.id = pos.reporter_id WHERE spt.cube = 3 AND spt.layer = 1 AND (rpt.id = 15295)

+----+-------------+-------+--------+---------------+---------+---------+------------------------------------+--------+-------------+
| id | select_type | table | type   | possible_keys | key     | key_len | ref                                | rows   | Extra       |
+----+-------------+-------+--------+---------------+---------+---------+------------------------------------+--------+-------------+
|  1 | SIMPLE      | rpt   | const  | PRIMARY       | PRIMARY | 4       | const                              |      1 | Using index |
|  1 | SIMPLE      | spt   | ref    | PRIMARY       | PRIMARY | 4       | const,const                        | 482406 |             |
|  1 | SIMPLE      | pos   | eq_ref | PRIMARY       | PRIMARY | 6       | const,base2dynamicdev.spt.position |      1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+------------------------------------+--------+-------------+

Does anyone else get this behaviour?

many thanks,

Bob.

Change History (30)

comment:1 by base, 16 years ago

Just to clarify... there's some table scanning going on which I guess could be avoided through some smart indexing.

comment:2 by Nicklas Nordborg, 16 years ago

The indexes that were removed were just copies of the primary key and made insertions up to 100 times slower in some cases... See #796.

I think a query time of 1.7 seconds is a small price to pay if for example an analysis plug-in run in 30 minutes instead of several hours. Unless anyone come up with any working ideas, I am going to close this as "wontfix".

comment:3 by base, 16 years ago

Thanks for the extra info and link to tickets/bugfixes.

It seems like once all analysis on an Experiment is done, the corresponding VirtualDB could be "signed off" and the relevant indexes added (manually at the mysql prompt) to speed up this retrieval? In our case it only seems necessary on the big experiments.

I will see if that works...

comment:4 by base, 16 years ago

Here are the indexes that seem to restore Experiment Explorer read-efficiency:

CREATE INDEX tmp ON base2dynamic.DXFilter(position); CREATE INDEX tmp ON base2dynamic.DXPos(reporter_id);

This only gives excellent speedup on filtered, processed BioAssaySets. Browsing the root BioAssaySet is still quite slow. Non-filtered BioAssaySets are not quite as fast. I guess a couple more indexes are needed for those cases...

comment:5 by base, 16 years ago

Side note: In BASE1 I often whizzed through a smallish (up to 100) reporter list in the experiment explorer coloured by an annotation and sorted by ratio. You can easily see a differentially expressed gene. The 1.7 seconds would make this painful.

comment:6 by base, 16 years ago

This fixes exploring the root bioassayset and filterless ones:

CREATE INDEX tmp ON base2dynamicdev.DXSpot(position);

and doesn't break the others.

Please could you show me how to enable these three indexes in the source code? I'll try a fresh 2.5 installation and see what the analysis write performance is like. I certainly remember it sometimes being painful before the dropindexes fix.

many thanks bob.

comment:7 by Nicklas Nordborg, 16 years ago

The code is in VirtualColumn.java. Check [3810] for changes that disabled the index creation.

The indexes will not be re-added because of the performance problems. Our biggest test run (when the indexes were still active) with 800+ bioassays had been running for 2 days and reach just over 50% when the server gave up and it was barely possible to ssh to it. After the indexes were removed the same code took 3 hours.

One idea is to implement an option to mark an experiment as read-only. This could create some indexes (possible user-selectable) but also disable adding data to it. The index creation will probably need to be some kind of scheduled action which could be done at night or during the weekend, and only one experiment at a time.

comment:8 by base, 16 years ago

Thanks for the info and those are interesting suggestions for possible implementations.

If I enable indexing on 'position' by modifying line 94 of VirtualColumn.java, won't this enable indexing of that column in all four dynamic tables? It looks like only two need the index (from my limited trials).

There is no such problem for 'reporter_id', this column is only found in DXPos;

Could a small number of indexes as I have suggested still jeopardise writing to the tables? I'm happy to benchmark this on our medium-sized test case and report back. It's worth a shot, surely?

Bob.

comment:9 by Nicklas Nordborg, 16 years ago

If I enable indexing on 'position' by modifying line 94 of VirtualColumn?.java, won't this enable indexing of that column in all four dynamic tables?

Yes, it will add the index to all tables where 'position' column is used.

Could a small number of indexes as I have suggested still jeopardise writing to the tables?

The only way is to try it out. We removed 7 (I think) indexes from the D#Spot and D#Pos tables. I guess 3 indexes will not be as bad as 7, but I don't know were the limit is. I guess a lot depends on how much data you have and on the hardware.

comment:10 by base, 16 years ago

I am now doing some testing in BASE2.5, with the following modification to VirtualColumn:

Index: VirtualColumn.java
===================================================================
--- VirtualColumn.java  (revision 4133)
+++ VirtualColumn.java  (working copy)
@@ -91,7 +91,7 @@
        */
        public static final VirtualColumn POSITION = 
                new VirtualColumn("position", "position", Hibernate.INTEGER, 0, 
-                       !INDEXED, !NULLABLE, PRIMARY_KEY);
+                       INDEXED, !NULLABLE, PRIMARY_KEY);
 
        /**
                This column stores the data cube filter number.
@@ -116,7 +116,7 @@
        */
        public static final VirtualColumn REPORTER_ID =
                new VirtualColumn("reporterId", "reporter_id", Hibernate.INTEGER, 0, 
-                       !INDEXED, NULLABLE, !PRIMARY_KEY);
+                       INDEXED, NULLABLE, !PRIMARY_KEY);
 
        /**
                This column stores the id of a raw date object. For perfomance reasons it is

will post the results in a few minutes...

comment:11 by base, 16 years ago

Version  	BASE 2.5.2pre (build #4088; schema #46)
Web server 	Apache Tomcat/5.5.20
Database Server 	MySQL 5.0.21-max-log
Database Dialect 	org.hibernate.dialect.MySQLInnoDBDialect
JDBC Driver 	com.mysql.jdbc.Driver (version 5.0)
Java runtime 	Java(TM) SE Runtime Environment (1.6.0-b105), Sun Microsystems Inc.
Operating system 	Linux amd64 2.6.16.54-0.2.5-smp
Memory 	Total: 247.8 MB
Free: 193.7 MB
Max: 910.3 MB

CPU = 2 x AMD Opteron(tm) Processor 250 @ 2400.000 MHz

Results with position and reporter_id indexes:

56 GenePix bioassays x 19806 spots, root bioassayset creation Median FG - Median BG: 17 minutes

Filter on raw('flags')>=0: 2 minutes

And then the same thing on the same hardware in BASE2.4 without the extra indexes:

Version  	BASE 2.4.6pre (build #3938; schema #40)
Web server 	Apache Tomcat/5.5.20
Database Server 	MySQL 5.0.21-max-log
Database Dialect 	org.hibernate.dialect.MySQLInnoDBDialect
JDBC Driver 	com.mysql.jdbc.Driver (version 5.0)
Java runtime 	Java(TM) SE Runtime Environment (1.6.0-b105), Sun Microsystems Inc.
Operating system 	Linux amd64 2.6.16.54-0.2.5-smp
Memory 	Total: 900.1 MB
Free: 523.5 MB
Max: 910.3 MB

56 GenePix bioassays x 19806 spots, root bioassayset creation Median FG - Median BG: 12 minutes

Filter on raw('flags')>=0: 24 seconds

Take home message:

there's still quite some performance hit with just these 2 columns indexed, but we could probably live with it to avoid the hassle of adding indices for improved read performance.

comment:12 by Nicklas Nordborg, 16 years ago

Thanks for help with testing this. The results for the filter plugin seems to be near the same results as we have. The surprising thing is the big difference between the filter plugin and the root bioassayset creation. Do you have many data sets in the raw data table? Are all 56 raw data sets using the same array design? If not, the performance of the root bioassayset creation will be worse, but I don't think it can explain the entire difference.

In our tests we only had the test data and the root bioassayset creation took approximately 1.5 -- 2 times longer than the filter plugin. This is without any extra indexes. With the extra indexes the filter plugin doubles the time, but the root bioassayset creation could take up to 10 times longer.

comment:13 by base, 16 years ago

No problem. Happy to do what ever testing time permits.

I did forget to mention that the 56 bioassays are a 50/50 mix of 20k and 4k arrays. I also can't guarantee that the dynamic tables were holding equivalent amounts of data at the time of testing. I think they were because each test was done in a new experiment. My tests aren't perfect though: BASE2.4_without_indexes vs BASE2.5_with_some_indexes.

I do notice that it takes a few minutes to delete bioassaysets (from the trash) with these indexes added back. I have not benchmarked this against no_indexes though.

Feel free to close this or assign to BASE2.9 or something.

We have to patch our BASE code anyway (to modify the reporter columns, config files, migration code) so we are happy to include the VirtualColumn index patch in every major update (we hope to do svn update for minor updates).

Thanks for all your help with this.

PS. We hope to show you and other BASE folk our simplified API (built on your API) and on-the-fly analysis web front end (built with our API) soon.

comment:14 by Nicklas Nordborg, 16 years ago

I guess deletions may suffer from the same problems as insertions.

I think there is no correct answer for which columns to index. It will always depend on the type of queries you are running and the tests clearly show that it is no possible to optimize all cases.

The best would probably be to expose some kind of functionality that allows users to add/remove indexes manually. A "Performance FAQ" could list some tips. Perhaps we need a special "CREATE INDEX" permission, that is granted to, for example, Power users. The index creation can probably be a big performance hog. Indexes should probably count against a user's quota as well. Our test database shrunk with about a third after the indexes were removed.

I'll keep this ticket open until we either decide to do nothing or to implement something like the above. Or, if someone comes up with a different idea....

comment:15 by Jari Häkkinen, 16 years ago

Milestone: BASE 2.x+

comment:16 by Jari Häkkinen, 16 years ago

After my move to a more BASE user based position I begin to understand that the experiment explorer is almost unusable in en experiment where one has done many steps of analysis. I think we should lift this ticket from 2.x to something more definite like 2.9 or 2.10.

comment:17 by Nicklas Nordborg, 15 years ago

Milestone: BASE 2.x+BASE 2.11

comment:18 by Nicklas Nordborg, 15 years ago

Type: defectenhancement

comment:19 by Nicklas Nordborg, 15 years ago

I have identified five different queries that are used in the experiment overview:

The first two are related to loading reporter information:

SELECT COUNT(DISTINCT `rpt`.`id`)
FROM `base2dynamic`.`D1Spot` `spt` 
INNER JOIN `base2dynamic`.`D1Pos` AS `pos` 
  ON `pos`.`cube` = `spt`.`cube` AND `pos`.`position` = `spt`.`position`
INNER JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id` 
WHERE `spt`.`cube` = 3 AND `spt`.`layer` = 1 
<optional additional filter conditions>;
SELECT DISTINCT `rpt`.`id` AS `id`, `rpt`.`external_id` AS `sorted0` 
FROM `base2dynamic`.`D1Spot` `spt` 
INNER JOIN `base2dynamic`.`D1Pos` AS `pos` 
  ON `pos`.`cube` = `spt`.`cube` AND `pos`.`position` = `spt`.`position`  
INNER JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id`  
WHERE `spt`.`cube` = 3 AND `spt`.`layer` = 1 
<optional additional filter conditions>
ORDER BY `rpt`.`external_id` ASC 
limit 100;

The first query is counting the number of distinct reporters used in a bioassay set. It optionally includes a filter condition that is set on the "Reporter search" tab. The ORDER BY clause can also very depending on which column the user selects on the tab.

The second query loads the reporter information for the reporters in the bioassay set. The result of the two queries are cached.

The next three are related to loading spot information

SELECT COUNT(DISTINCT `spt`.`position`) AS `posCount` 
FROM `base2dynamic`.`D1Spot` `spt` 
LEFT JOIN `base2dynamic`.`D1Pos` AS `pos` 
  ON `pos`.`cube` = `spt`.`cube` AND `pos`.`position` = `spt`.`position`  
LEFT JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id`  
WHERE `spt`.`cube` = 3 AND `spt`.`layer` = 1 AND (`rpt`.`id` = 35866);

This query simply counts the number of distinct positions were a given reporter is found. This information is not cached, and the query is repeated on every page request.

SELECT DISTINCT `spt`.`position` AS `position` 
FROM `base2dynamic`.`D1Spot` `spt` 
LEFT JOIN `base2dynamic`.`D1Pos` AS `pos` 
  ON `pos`.`cube` = `spt`.`cube` AND `pos`.`position` = `spt`.`position`  
LEFT JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id`  
WHERE `spt`.`cube` = 3 AND `spt`.`layer` = 1 AND (`rpt`.`id` = 35866) 
ORDER BY `spt`.`position`  ASC 
limit 1, 1;

This query is used to load the position number of the current position. The parameters in the 'limit' clause varies depending on which position we want to look at. Eg. between 1 and the count found in the previous query. This information is not cached and the query is repeated on every page request.

SELECT 
  `spt`.`column` AS `column`, `spt`.`position` AS `position`, 
  `spt`.`ch1` AS `ch1`, `spt`.`ch2` AS `ch2` 
FROM `base2dynamic`.`D1Spot` `spt` 
WHERE `spt`.`cube` = 3 AND `spt`.`layer` = 1 AND (`spt`.`position` = 25175) 
ORDER BY `spt`.`position` ASC;

This is the final query that loads the actual spot data. It may look very different depending on which columns the user decides to display.

My first though is that we should cache and/or preload a lot more information. It should be possible to at least pre-load all reporter id:s and their matching positions. It should not pose any problem to keep that information in memory, and if it does, it would probablt be better to use a temporary file as a cache instead of querying the database on every page request. That should leave only the last query to worry about.

comment:20 by Nicklas Nordborg, 15 years ago

Owner: changed from everyone to Nicklas Nordborg
Status: newassigned

comment:21 by Nicklas Nordborg, 15 years ago

(In [4808]) References #903: Experiment Explorer is slow

Increased the cache to include all matching reporters and their positions. This means that we can get rid of 2 of the 3 queries that are loading spot/position information. I have no measurements of the performance increase, but the it feels a bit faster.

The initial caching of reporter/position information still takes some time. The information we need is very similar to Export 2 test in ticket #1238 and the time needed is about the same as after the fix from [4807].

The loading of spot data also takes some time when the number of bioassays grow large. I get around 3 seconds for 80 bioassays on 'auster'.

comment:22 by Nicklas Nordborg, 15 years ago

(In [4809]) References #903: Experiment Explorer is slow

Sort positions in ascending order.

comment:23 by Nicklas Nordborg, 15 years ago

I'll continue by checking the two remaining queries. The first query is used to load reporter/position infomation into the cache. It is only done when first entering Experiment explorer or when the "Reporter search" filter/sort order has changed.

SELECT 
  DISTINCT `pos`.`reporter_id` AS `reporterId`, `spt`.`position` AS `position`, `rpt`.`external_id` AS `sorted0` 
FROM `base2dynamic`.`D1Spot` `spt` 
LEFT JOIN `base2dynamic`.`D1Pos` AS `pos` 
  ON `pos`.`cube` = `spt`.`cube` AND `pos`.`position` = `spt`.`position`  
LEFT JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id`  
WHERE `spt`.`cube` = 4 AND `spt`.`layer` = 1 
<additional filter conditions as specified on the "Reporter search" tab>
ORDER BY `rpt`.`external_id` ASC, position ASC;

NOTE! The first colum in the "ORDER BY" part depends on the column used for sorting in the "Reporter search". The performance of this query is similar to the "Export 2" test in ticket #1238 and the time needed is about the same as after the fix from [4807]. As anyone can see the query is not a very fast one. It needs several seconds or even minutes on large data sets.

The second query is used to load spot information. It can take different forms, depending on if it is a filtered bioassay or not. The query may also have joins to the raw data table depending on which columns the user has selected to display. This query is executed on each page request.

SELECT 
  `spt`.`column` AS `column`, `spt`.`position` AS `position`, 
  `spt`.`ch1` AS `ch1`, `spt`.`ch2` AS `ch2` 
FROM `base2dynamic`.`D1Spot` `spt` 
WHERE `spt`.`cube` = 4 AND `spt`.`layer` = 1 AND (`spt`.`position` = 1531) 
ORDER BY `spt`.`position` ASC;

comment:24 by Nicklas Nordborg, 15 years ago

Regarding the first reporter/position caching query, I think there are some improvements that can be made. It seems like the major performance hit is on the join beteween D1Spot and D1Pos. We can make this go away if we split the query into two:

First, load all used position numbers in the current bioassays set:

select 
  distinct `spt`.`position` 
from `base2dynamic`.`D1Spot` `spt` 
where 
  `spt`.`cube` = 4 and `spt`.`layer` = 1

This query also takes some time but not as much as the original. The best is that this information is static. We can cache it as long as we like since it will never change. Thus, the query doesn't need to be repeated.

The second query:

SELECT 
  DISTINCT `pos`.`reporter_id` AS `reporterId`, `pos`.`position` AS `position`, `rpt`.`external_id` AS `sorted0` 
FROM `base2dynamic`.`D1Pos` `pos` 
LEFT JOIN `Reporters` AS rpt 
  ON rpt.id = pos.`reporter_id`  
WHERE `pos`.`cube` = 4 
ORDER BY `rpt`.`external_id` ASC;

This is more or less the same as the original query except that we don't use D1Spot anymore. This query is also rather quick, but it gives us too many results. We need to apply a filter in code that simply ignores position numbers that doesn't exists at the current bioassay set. For example, because they have been filtered away in an earlier step.

comment:25 by Nicklas Nordborg, 15 years ago

(In [4815]) References #903: Experiment Explorer is slow

Testing the idea of splitting reporter/position cache query.

comment:26 by Nicklas Nordborg, 15 years ago

I have no measured times on this by it feels a bit better. I would say that loading times for the first page is about half of what it was before. Moving back and forth between reporters are a lot quicker. When data sets grow larger (80+ bioassays) the execution time of the second query becomes noticable. But to improve that I think an index on the 'position' column is needed (the same thing was pointed out earlier in this ticket by Bob). I'll try this and will also check what effect this has on creating new bioassay sets and filtering.

comment:27 by Nicklas Nordborg, 15 years ago

(In [4817]) References #903: Experiment Explorer is slow

Testing the idea of using an index on the D#Spot.position column.

comment:28 by Nicklas Nordborg, 15 years ago

Results from 'grey'

# = Number of assays R = No index I = With index 1 = First run 2 = Second run

#	R1	R2	I1	I2
------------------------------------
10	0.41	0.20	0.29	1.29
20	1.10	1.17	2.29	4.21
40	3.57	2.44	5.15	13.57
80	6.04	7.33	7.40	39.10
160	18.42	13.39	18.53	20.33
320	31.45	36.57	35.16	44.48
640	71.07	78.37	81.31	82.20
  • Execution time for root bioassay goes up. Expect at least 25-50% longer times, but occasionally over 500%.
  • The size of the database file corresponding to the D#Spot table increases with about 75%. Space taken by the index is about 40% of the total file size.
  • Experiment explorer is considerably faster for large experiments (80+ bioassays). There is no noticable delay when clicking around as was the case before.

I have not tested the filter or Lowess plug-in. I don't expect filtering to be affected, since it doesn't insert data into the table with the index. Lowess would be affected just as much as the root bioassay set creation, but since ticket #1257 has changed the Lowess implementation we can't compare the measurments any more.

comment:29 by Nicklas Nordborg, 15 years ago

So, a decision needs to be made. Should we have an index on the 'position' column or not?

I vote for not having an index. The main reason is the occasional performance drop that affects the entire server for large periods. Even though it doesn't happen very often in a single experiment, on a server which hosts multiple experiments it would happen often enough to annoy other users that are trying to use the server. The performance penalty of not having the index in experiment explorer is not big and will only affect the server momentarily. In any case, the caching of reporter/position information will still be the major performance bottleneck and this is not affected by the index.

Another reaseon is that the index is only added to new experiments. Existing experiments will either have to do without an index or we need to execute some SQL to create the index. I have done that on 'grey' which needed almost 5 hours to create the index on a 10Gb file. On the production server the total size of tables that needs the index is over 120Gb.

comment:30 by Nicklas Nordborg, 15 years ago

Resolution: fixed
Status: assignedclosed

(In [4829]) Fixes #903: Experiment Explorer is slow

Store list of position in the static file cache. Removed index on 'position' column.

Note: See TracTickets for help on using tickets.