Opened 6 years ago

Closed 6 years ago

#1618 closed enhancement (fixed)

Improve "Reporter search" performance in experiment explorer

Reported by: nicklas Owned by: nicklas
Priority: major Milestone: BASE 3.0
Component: core Version:
Keywords: Cc:


Since it seems like #1616 doesn't improve the performance I'll create a separate ticket for this.

Change History (8)

comment:1 Changed 6 years ago by nicklas

I made some changes to the query fetching the reporter information. Instead of starting out at the spot data table (D#Spot) I only included the position data table (D#Pos). It's a big change, and even the with the bioassayset with 100 bioassays the reporter search is almost instant (and I still have the 2.5M reporters in the reporter table!). Also with 500 bioassays it is quick enough (faster than it was with 10 bioassays before).

The drawback is that this approach returns too many reporters (see comment:24:ticket:903 about the second query). This makes the gui a bit confusing if analysis has filtered away a lot of positions in previous steps since the number of reporters matching in the "Reporter search" will not match match the number of reporters in the "Reporter view" tab.

comment:2 Changed 6 years ago by nicklas

One idea I had was that we should duplicate the reporter_id column in the D#Spot table so that we can join in the reporters directly instead of having to go via the D#Pos table. I have tested this and unfortunately it doesn't make much difference. It is slightly faster but only by 10-25% or so. The "reporter search" still feels slow. The query used is something like

   `rpt`.`id` AS `id`, 
   `rpt`.`name` AS `name`, 
   `rpt`.`external_id` AS `externalId` 
FROM `base217dynamic`.`D3Spotx` `spt` 
INNER JOIN `Reporters` AS rpt 
  ON = spt.`reporter_id` 
  `spt`.`cube` = 3 AND `spt`.`layer` = 1 
  <optional condition from "Reporter search">
ORDER BY `rpt`.`name` ASC

But this is still with 2.5M reporters in the reporter table. I'll see what happens if the number of reporters is less. If that is better then #1616 may help if combined with a duplicate reporter_id column.

comment:3 Changed 6 years ago by nicklas

Not much better even if the last test is repeated with a smaller (~50k) reporter table. The only conclusion is that it is the spot data table that is causing the slow performance. In the tests I have made this table contains around ~24M rows and it will continue to grow as the analysis tree gets bigger. The current size is with 4 root bioassay sets only (total 660 bioassays). So, I think we need to keep the spot data out of the "reporter search" functionality. I'll try some more things out...

comment:4 Changed 6 years ago by nicklas

I hope I finally some good news. I made some initial testing with a completely different approach for the "reporter search". The idea is to simply skip all joins to the dynamic tables and only use the reporter table. For this to work, we need to know which reporters that should be listed in the search result. But this is relatively easy if we build the reporter/position cache first (this used to be done when switching to the "reporter view" page) and then simply restrict the search by reporter-id. It's fairly quick and produces a result within a second or less even with 500 assays. The only drawback is that the query now only return exactly those reporters that are listed on the current page. The regular paging mechanism and other functions (eg. create reporter list) that operate on the complete result set will not work without modifications but that should be relatively easy to fix. Best of all, this fix doesn't require any additional info in the database tables, and no special upgrade script is needed.

I also discovered that the "reporter view" page probably can be improved by listing all bioassays as a restriction in the query (eg. WHERE column IN (<list of bioassays>)). Seems like the query optimizer can do a lot better job with the explicit information and find each row immediately without having to do a table scan. My guess is that it can now use the complete primary key information (cube, layer, column and position) instead of only (cube and layer).

My 500 assays test case went from 6 seconds to load a page down to less than a second. Here is the resulting query (from my 10 assays test case):

  `spt`.`column` AS `column`, `spt`.`position` AS `position`, 
  `spt`.`ch1` AS `ch1`, `spt`.`ch2` AS `ch2` 
FROM `base217dynamic`.`D3Spot` `spt` 
WHERE `spt`.`cube` = 1 AND `spt`.`layer` = 1 
  AND (`spt`.`column` IN (1,2,3,4,5,6,7,8,9,10)) 
  AND (`spt`.`position` = 13733) 
ORDER BY `spt`.`position` ASC

The spt.column IN (...) part is what was added. It doesn't change the result of the query, but seems to improve the speed a lot.

comment:5 Changed 6 years ago by nicklas

  • Owner changed from everyone to nicklas
  • Status changed from new to assigned

comment:6 Changed 6 years ago by nicklas

(In [5711]) References #1618: Improve "Reporter search" performance in experiment explorer

Different implementation on "Reporter search" that uses the cached reporter ids and use a regular query against the Reporters table only. As a result, the "Create reporter list" function needed to fixed as it can no longer use the query. Instead it also uses the cached reporter ids.

Added filter condition for bioassays when loading the "Reporter view" page since it seems to be a lot quicker (see comment 4).

Fixed an issue with generating incorrect SQL when filtering on a reporter list. I think this broke as part of #903, but has gone unnoticed since then. A join to the reporter list table was missing due to no auto-join support in DynamicPositionQuery and DynamicReporterQuery. Auto-join support has now been added to those query implementations.

Everything is a lot quicker now on my development machine. I have 2.5M reporters in the reporter table and an experiment with 24M rows of spot data in 500 bioassays. So far no action seems to take more than a couple of seconds to complete.

comment:7 Changed 6 years ago by nicklas

(In [5757]) References #1618: Improve "Reporter search" performance in experiment explorer

Removing debug output.

comment:8 Changed 6 years ago by nicklas

  • Resolution set to fixed
  • Status changed from assigned to closed

Tested with some data from the production server. Except for the initial stage which caches position mappings, clicking around in experiment explorer is <1 second to load a page. Here are some numbers:

  • Raw bioassays: 480
  • Number of spots/bioassay set: 23M or less
  • Number of total spots in experiment: 180M
  • Number of reporters in database: 2.5M
Note: See TracTickets for help on using tickets.