Forum OpenACS Q&A: comparing two columns of different tables...

1: comparing two columns of different tables...

Posted by David Kuczek on 11/13/02 08:04 PM

I have two HUGE tables and would like to automatically merge table_2 with table_1:

table_1 -- name_1, many valuable_columns_1, one_valueable_column_2

table_2 -- name_2, one_valuable_column, one_valueable_column_2

The problem is that name_1 and name_2 are sometimes equal but mostly differ just a little bit.

Does anyone have an idea how to automatically update table_1 with the most probable value of one_valuable_column_2?

2: Re: comparing two columns of different tables... (response to 1)

Posted by Jeff Davis on 11/13/02 08:10 PM

With magic?

Seriously, without more information I think it would not possible to answer this question. Some things to consider would be soundex or other means of making things more canonical (like case insensitive, whitespace removed, vowels removed, common words removed etc)

What are the keys?

3: Re: comparing two columns of different tables... (response to 1)

Posted by David Kuczek on 11/13/02 09:53 PM

Magic sounds good.... Harry Potter II will be out soon 😉

I thought about some kind of search like the old bboard search. I would copy the result with the highest percentage to table_1. Maybe a combination would be best. Taking out special characters etc and then comparing name_2 with name_1 via search...

The problem in a little more detail:

table_1 is my main table. I will add a new column to table_1 which will hold the valuable information from table_2 (one_valueable_column_2). After I matched the tables I will drop table_2 and have the information from table_2 in the new column of table_1.

By the way... a little bit off topic but I've just had a problem with a regexp:

regexp -nocase {Information</b></td>\s*<td>(.*?)</td>} $html_to_search match information

This regexp doesn't search to then *next* </td>, but to the *last* </td> of the page... How can I regexp to the *next* </td>???

4: Re: comparing two columns of different tables... (response to 1)

Posted by Michael Hinds on 11/14/02 01:24 PM

On the regexp, you're talking about greedy vs. non-greedy. Have a look at this.

5: Re: comparing two columns of different tables... (response to 3)

Posted by Jeff Davis on 11/14/02 01:35 PM

If you put a ? after \s* it should work right. I don't understand why it matters that \s* be non-greedy here but it does. Maybe someone can actually explain it to me...

% set x { 
Information</b></td>
<td>Blah blah Blah</td>
<td>more blah</td>
}
% regexp -nocase {nformation</b></td>\s*<td>(.*?)</td>} $x match foo
1
% set foo
Blah blah Blah</td>
<td>more blah
% regexp -nocase {nformation</b></td>\s*?<td>(.*?)</td>} $x match foo
1
% set foo
Blah blah Blah
%

6: Re: comparing two columns of different tables... (response to 1)

Posted by Peter Marklund on 11/14/02 02:06 PM

Jeff,
this seems totally random, could it be that non-greedy regexps in tcl are not quite reliable. I remember having a problem getting them to work once and couldn't track down what the reason was. Seems like the non-greediness of the white-space quantifier carries over on the quantifier between the tags. I experimented and noticed that it's ok to have a non-greedy quantifier after the tag, but not before, regardless of whether it's whitespace or not.

7: Re: comparing two columns of different tables... (response to 1)

Posted by Dan Wickstrom on 11/14/02 03:28 PM

Another way to get this is to just match to the next '<':

% regexp -nocase {nformation</b></td>\s*<td>([^<]+)</td>} $x match foo
1
% set foo
Blah blah Blah

Generally, I try to avoid re's that have .* or .+ expressions in them, since it really slows down the evaluation. In this case, since you're doing a non-greedy evaluation, it shouldn't be a problem, but the interaction between greedy and non-greedy expressions can be tricky.