Opened 13 years ago

Closed 13 years ago

#837 closed defect (fixed)

Lowess plugin. Error: Sum of weigths in line_fit is not positive

Reported by: base Owned by: everyone
Priority: trivial Milestone: BASE 2.6
Component: coreplugins Version:
Keywords: Cc:

Description

Sorry if this should go to baseplugins trac, but as a core plugin maybe it should go here.

I just ran BASE2 lowess on 54 bioassays and got the error below (right at the end of the job). I think the problem could relate to having three different array designs in the same experiment. When I run on a subset with the same array design it works fine (I just ran it again to check).

cheers, Bob.

View job -- Run plugin: Normalization: Lowess
Name 	Run plugin: Normalization: Lowess
Description 	
Priority 	5 (1 = highest, 10 = lowest)
Status 	Error: Sum of weigths in line_fit is not positive
Percent complete 	
 
	 100%
Created 	2007-11-22 11:58:24
Started 	2007-11-22 11:58:39
Ended 	2007-11-22 12:10:50
Server 	bio-iisrv1.bio.ic.ac.uk
User 	Bob MacCallum
Experiment 	- none -
Plugin 	Normalization: Lowess
Configuration 	- none -

net.sf.basedb.core.BaseException: Sum of weigths in line_fit is not positive
at net.sf.basedb.plugins.LowessNormalization.weightedLeastSquaresRegression(LowessNormalization.java:574)
at net.sf.basedb.plugins.LowessNormalization.lowess(LowessNormalization.java:487)
at net.sf.basedb.plugins.LowessNormalization.run(LowessNormalization.java:297)
at net.sf.basedb.core.PluginExecutionRequest.invoke(PluginExecutionRequest.java:89)
at net.sf.basedb.core.InternalJobQueue$JobRunner.run(InternalJobQueue.java:421)
at java.lang.Thread.run(Thread.java:619)

Job parameters

Blockgroup size 	1
Child description 	
Child name 	All hybs - bg subtracted - no bad spots - lowess
Minimum log(intensity) step 	0.1
Window size (fraction of points) 	0.33
Iterations 	4
Source bioassay set 	All hybs - bg subtracted - no bad spots

BASE version info:

Version  	BASE 2.4.6pre (build #3938; schema #40)
Web server 	Apache Tomcat/5.5.20
Database Server 	MySQL 5.0.21-max-log
Database Dialect 	org.hibernate.dialect.MySQLInnoDBDialect
JDBC Driver 	com.mysql.jdbc.Driver (version 5.0)
Java runtime 	Java(TM) SE Runtime Environment (1.6.0-b105), Sun Microsystems Inc.
Operating system 	Linux amd64 2.6.16.53-0.16-smp
Memory 	Total: 359.1 MB
Free: 99.1 MB
Max: 910.3 MB

Attachments (1)

base_lowess.cc (5.9 KB) - added by Nicklas Nordborg 13 years ago.
BASE 1 implementation of lowess

Download all attachments as: .zip

Change History (11)

comment:1 Changed 13 years ago by Nicklas Nordborg

I don't think it has anything to do with the array designs being different. Lowess works on a single bioassay at a time and the algorithm doesn't depend on the array design information. I am pretty sure that the bug is triggered by the the specific data in one (or more) of the bioassays.

I took a quick look at the code and I think there is a risk that "Sum of weigths" can be 0. The weights are calculated on line 605 and if 'distance < 1' for all values in the list then all weights will also be 0. I don't know enough about the Lowess algorithm to say if it is possible that all distances are < 1 or not. Maybe if there is only a single value in the list... There seems to be code doing medians, least squares, etc. and as far as I know that doesn't work very good if there are not enough data points.

Does that give you any clue? Can you find any bioassay which has few data points? The Lowess plug-in works on each block separately so you should look for a bioassays with few data points in a single block.

comment:2 Changed 13 years ago by Nicklas Nordborg

Line 279-280 makes sure that a block is ignored if there are no data points. Maybe the limit should be 1?

comment:3 Changed 13 years ago by Johan Enell

This is not a defect but a limitation in the algorithm. The reason is that the dataset that weightedLeastSquaresRegression is working on is to small and it is to small because you have set the parameter 'Blockgroup size' to 1. That means that Lowess is working on one block at a time. When I use it in my analysis I would set that value to 16, 24, or 48 when using an ArrayDesign? that contains 48 blocks. In your case using different ArrayDesigns? I would set that value to the larges number of blocks in the designs.

Having 1 as a default value is not good and i have opend a ticket on that, #838.

comment:4 Changed 13 years ago by Nicklas Nordborg

I think it is a defect since it obviosuly produces a result for some bioassays in the bioassay set but not all. I think the plug-in should filter out/ignore data sets that are too small to be usable. If the algorithm is fed to few data points (because of splitting into blocks or because of heavy filteringen or any other reason).

I checked the BASE 1 plug-in code and it seems to behave differently and output a warning only if that happens. It still produced as result for the cases that works. I attach the BASE 1 code.

Changed 13 years ago by Nicklas Nordborg

Attachment: base_lowess.cc added

BASE 1 implementation of lowess

comment:5 Changed 13 years ago by Nicklas Nordborg

I have investigated the code a bit more and done some test runs. I have found that the calculation fails if there if there is not enough data points in a block group to make the windowSize > 1. Even I, with limited statistical knowledge, can understand that doing a least square fitting with only one data point is rather useless.

The calculations seems to work if windowSize >= 2, but... How many data points is really needed in a window to make Lowess useful? With only two data points in a window the least square fitting will not change anything?

How should we handle this in the plug-in? It is easy to check the window size and filter out block groups that doesn't contain enough data points.

Why does the block group parameter exists in the first place? Shouldn't we always use all spots in a bioassay?

comment:6 in reply to:  5 Changed 13 years ago by Johan Enell

Replying to nicklas:

The calculations seems to work if windowSize >= 2, but... How many data points is really needed in a window to make Lowess useful? With only two data points in a window the least square fitting will not change anything?

I we make a filter I think that we are only obligated to make sure that the algorithm runs. Set the filter on windowSize >= 2. To help the user I think that we should print an error if windowSize is less then 100.

Why does the block group parameter exists in the first place? Shouldn't we always use all spots in a bioassay?

In most cases you want to run lowess on all spots. That's way I opened #838. But in some cases your slides has been effected by some technical issues and you don't want to effect the normalization. This is of course a critical decision because it affects the algorithm.

comment:7 Changed 13 years ago by base

Bob here...

You are right, it's not the different array designs.

I've narrowed it down to a set of 14 hybs (same design). None of them look particularly strange - indeed they all lowessed fine in BASE1. Some of them look very centred around zero already, could this be causing the problem?

I'm just running lowess on alread-lowessed bioassays to see if this triggers the bug. No that worked fine.

A little more info: The bioassayset has 14 hybs, 41216 spots and 3386 reporters. The overview plots all have roughly the same number of spots (around 3000).

Will look into this more later today.

comment:8 Changed 13 years ago by Johan Enell

Milestone: BASE 2.6

A more informative printout is needed explaining why the normalization was aborted. The default block number should be 0 (zero).

comment:9 Changed 13 years ago by Nicklas Nordborg

Priority: majortrivial

comment:10 Changed 13 years ago by Nicklas Nordborg

Resolution: fixed
Status: newclosed

(In [4062]) Fixes #837: Lowess plugin. Error: Sum of weigths in line_fit is not positive

Changed the error message to something more understandable.

Note: See TracTickets for help on using tickets.