Opened 18 years ago

Closed 17 years ago

#573 closed defect (fixed)

Trim whitespace when checking for unique values

Reported by: Nicklas Nordborg Owned by: Nicklas Nordborg
Priority: major Milestone: BASE 2.4
Component: core Version:
Keywords: Cc:

Description (last modified by Nicklas Nordborg)

Relates to ticket:574 and ticket:469

From the mailing list by Bob MacCallum:
http://sourceforge.net/mailarchive/forum.php?thread_name=17964.59444.651455.264772%40bio-iisrv1.bio.ic.ac.uk&forum_name=basedb-users

  1. I think there's some inconsistent handling of trailing spaces in the "reporter ID" column of a genepix .gpr file. For example I can import reporters, and create an array design from the file pasted below, but I can't then import the raw data!

(the following is just 8 lines long - if the long lines get mangled, I'll send a copy by mail on request)

> ATF    1.0
> 27    43    Type=GenePix Results 1.4
> "Block"    "Column"    "Row"    "Name"    "ID"    "X"    "Y"    "Dia."    "F635 Median"    "F635 Mean"    "F635 SD"    "B635 Median"    "B635 Mean"    "B635 SD"    "% > B635+1SD"    "% > B635+2SD"    "F635 % Sat."    "F532 Median"    "F532 Mean"    "F532 SD"    "B532 Median"    "B532 Mean"    "B532 SD"    "% > B532+1SD"    "% > B532+2SD"    "F532 % Sat."    "Ratio of Medians"    "Ratio of Means"    "Median of Ratios"    "Mean of Ratios"    "Ratios SD"    "Rgn Ratio"    "Rgn R²"    "F Pixels"    "B Pixels"    "Sum of Medians"    "Sum of Means"    "Log Ratio"    "F635 Median - B635"    "F532 Median - B532"    "F635 Mean - B635"    "F532 Mean - B532"    "Flags"
> 1    1    1    "demoA"    "demorep1"    1690    5730    110    183    181    42    59    62    25    100    98    0    276    270    48    64    65    13    100    100    0    0.585    0.592    0.570    0.576    1.357    0.591    0.782    80    621    336    328    -0.774    124    212    122    206    0
> 1    2    1    "demoB"    "demorep2 "    1910    5730    120    114    137    175    57    61    37    71    21    0    346    341    80    63    65    35    96    95    0    0.201    0.288    0.192    0.209    2.379    0.398    0.094    120    716    340    358    -2.312    57    283    80    278    0
> 1    3    1    "demoC"    "demorep3"    2110    5740    110    145    148    43    63    68    30    92    68    0    208    214    48    69    74    43    98    93    0    0.590    0.586    0.599    0.541    1.987    0.504    0.582    80    566    221    230    -0.761    82    139    85    145    0
> 1    4    1    "demoD"    "demorep4"    2300    5730    110    185    187    51    59    63    23    100    96    0    298    294    57    64    67    24    100    98    0    0.538    0.557    0.526    0.538    1.599    0.549    0.730    80    590    360    358    -0.893    126    234    128    230    0

the stacktrace from the raw data import is:

> net.sf.basedb.core.BaseException: Item not found: Reporter mismatch: The feature has reporter 'demorep2' whereas you have given 'demorep2 ' on line 6: 1 2 1 "demoB" "de...
> at net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:592)
> at net.sf.basedb.plugins.AbstractFlatFileImporter.run(AbstractFlatFileImporter.java:442)
> at net.sf.basedb.core.PluginExecutionRequest.invoke(PluginExecutionRequest.java:88)
> at net.sf.basedb.core.InternalJobQueue$JobRunner.run(InternalJobQueue.java:420)
> at java.lang.Thread.run(Thread.java:619)
> Caused by: net.sf.basedb.core.ItemNotFoundException: Item not found: Reporter mismatch: The feature has reporter 'demorep2' whereas you have given 'demorep2 '
> at net.sf.basedb.core.RawDataBatcher.doInsert(RawDataBatcher.java:390)
> at net.sf.basedb.core.RawDataBatcher.insert(RawDataBatcher.java:343)
> at net.sf.basedb.plugins.RawDataFlatFileImporter.handleData(RawDataFlatFileImporter.java:544)
> at net.sf.basedb.plugins.AbstractFlatFileImporter.doImport(AbstractFlatFileImporter.java:570)
> ... 4 more

I think BASE1 was more tolerant.

Leading and trailing blanks are trimmed from more or less all values before they are inserted in the database and that explains why you get "demorep2" instead of "demorep2 ". I guess we never though of doing the same when checking if a reporter (or something else with a unique value) exists in the database or not. I think there are several other places affected by the same thing. I'll add this as a bug in our trac database. In the meantime you can try using a splitter regexp that also removes white-space. Try something like \s*\t\s* instead of just \t. I have not tested this but it might be enough to make it work.

Change History (8)

comment:1 by Jari Häkkinen, 18 years ago

Milestone: BASE 2.4BASE 2.3

Milestone BASE 2.4 deleted

comment:2 by Jari Häkkinen, 18 years ago

Description: modified (diff)

comment:3 by Nicklas Nordborg, 18 years ago

Description: modified (diff)
Priority: minormajor
Status: newassigned

comment:4 by Nicklas Nordborg, 18 years ago

(In [3468]) References #574, #573 and #469. Created a simple test case with dirty data file.

comment:5 by Nicklas Nordborg, 18 years ago

Resolution: fixed
Status: assignedclosed

(In [3471]) Fixes #574 and #573.

comment:6 by Nicklas Nordborg, 17 years ago

(In [3560]) References #573. Must also trim annotation values or the filter will not display the selected value

comment:7 by Nicklas Nordborg, 17 years ago

Resolution: fixed
Status: closedreopened

As it turns out, BASE 1 only removed trailing white spaces...not leading. The migration from the demo server fail beacuse there a reporters having ID: '25' and ' 25'.

What do we do about this since only one of them can be migrated to BASE 2? Can we safely assume that it is a mistake and that both reporter are actaully the same. In BASE 2 they would be imported as the same reporter.

  1. Ignore the second one and map all references to the first?
  2. Rename the second one? How do make sure that the new name is unique? What if there are ' 25' (two leading spaces) or ' 25' (three leading spaces)
  3. Other ideas...

Since BASE 2 would map both entries to the same reporter if doing an export from BASE 1 and then an import to BASE 2 I think that is how the migration should work as well. I am going for option 1 unless someone objects in the next hour or so...

comment:8 by Nicklas Nordborg, 17 years ago

Resolution: fixed
Status: reopenedclosed

(In [3634]) Fixes #573: Trim whitespace when checking for unique values

Note: See TracTickets for help on using tickets.