Forum OpenACS Q&A: comparing two columns of different tables...

I have two HUGE tables and would like to automatically merge table_2 with table_1:

table_1 -- name_1, many valuable_columns_1, one_valueable_column_2

table_2 -- name_2, one_valuable_column, one_valueable_column_2

The problem is that name_1 and name_2 are sometimes equal but mostly differ just a little bit.

Does anyone have an idea how to automatically update table_1 with the most probable value of one_valuable_column_2?

Collapse
Posted by Jeff Davis on
With magic?

Seriously, without more information I think it would not possible to answer this question. Some things to consider would be soundex or other means of making things more canonical (like case insensitive, whitespace removed, vowels removed, common words removed etc)

What are the keys?

Collapse
Posted by David Kuczek on
Magic sounds good.... Harry Potter II will be out soon 😉

I thought about some kind of search like the old bboard search. I would copy the result with the highest percentage to table_1. Maybe a combination would be best. Taking out special characters etc and then comparing name_2 with name_1 via search...

The problem in a little more detail:

table_1 is my main table. I will add a new column to table_1 which will hold the valuable information from table_2 (one_valueable_column_2). After I matched the tables I will drop table_2 and have the information from table_2 in the new column of table_1.

By the way... a little bit off topic but I've just had a problem with a regexp:

regexp -nocase {Information</b></td>\s*<td>(.*?)</td>} $html_to_search match information

This regexp doesn't search to then *next* </td>, but to the *last* </td> of the page... How can I regexp to the *next* </td>???

Collapse
Posted by Michael Hinds on
On the regexp, you're talking about greedy vs. non-greedy. Have a look at this.
Collapse
Posted by Jeff Davis on
If you put a ? after \s* it should work right. I don't understand why it matters that \s* be non-greedy here but it does. Maybe someone can actually explain it to me...
% set x { 
Information</b></td>
<td>Blah blah Blah</td>
<td>more blah</td>
}
% regexp -nocase {nformation</b></td>\s*<td>(.*?)</td>} $x match foo
1
% set foo
Blah blah Blah</td>
<td>more blah
% regexp -nocase {nformation</b></td>\s*?<td>(.*?)</td>} $x match foo
1
% set foo
Blah blah Blah
% 
Collapse
Posted by Peter Marklund on
Jeff,
this seems totally random, could it be that non-greedy regexps in tcl are not quite reliable. I remember having a problem getting them to work once and couldn't track down what the reason was. Seems like the non-greediness of the white-space quantifier carries over on the quantifier between the tags. I experimented and noticed that it's ok to have a non-greedy quantifier after the tag, but not before, regardless of whether it's whitespace or not.
Collapse
Posted by Dan Wickstrom on
Another way to get this is to just match to the next '<':

% regexp -nocase {nformation</b></td>\s*<td>([^<]+)</td>} $x match foo
1
% set foo
Blah blah Blah

Generally, I try to avoid re's that have .* or .+ expressions in them, since it really slows down the evaluation. In this case, since you're doing a non-greedy evaluation, it shouldn't be a problem, but the interaction between greedy and non-greedy expressions can be tricky.