Forum OpenACS Q&A: what is the best practive for extracting xml like info from variables

Hi, I am getting pretty frustrated with this, so i figure i might was well ask the community since somebody is bound to know the best way to do this. I have a variable like this which is retrieved from the database via a particular view of the postal_addresses table:

<delivery_address>1234 Main Street</delivery_address>
<postal_code>12345</postal_code>
<municipality>MyCity</municipality>
<region>CA</region>
<country_code>US</country_code>
<additional_text>C/O Somebody Else</additional_text>

I would like to parse this variable and extract the individual items. I.e. i want to say, give me what is between <region> and </region> and put it into the variable called "region". The thing is, i want to be able to specify what variable i want to extract based on another variable. I.e. i want to be able to say:
set extract_this_item "region"
and have it extract whatever is set as "extract_this_item". I have tried this: but its not working...
regsub -all {\t<$extract_this_item>([^\t]*)<\/$extract_this_item>\t} $value {\1} extracted_item
My reason for doing this is that it will significantly reduce the number of trips to and from the database that will be needed for the contacts application i am working on - which should speed things up quite a bit... Also, i'm going to use this for other more complex implementations - not just this simple 5 line value. Thanks for the help
Matthew,

I can't understand why you would think this is necessary. It seems like the kind of data that could be modeled easily in a relational database.

That said, if you want to format it like xml, why not put an <?xml version="1.0"?> at the top and parse it with ns_xml or tDOM?

I agree - if you make it clean xml, you can use the ns_xml libraries, if you just want it to persist, how about using the -column_set flag on the select and you can store it as a TCL native hash.

Look at the example on this page:
https://openacs.org/doc/openacs-4-6-3/db-api-detailed.html

In the section called Placing Column Values in Arrays and Sets

db_foreach users_select "select first_names, last_name from users" -column_set columns {
    # Now $columns is an ns_set.
    doc_body_append "<li>"
    for { set i 0 } { $i < [ns_set size $columns] } { incr i } {
        doc_body_append "[ns_set key $columns $i] is [ns_set value $columns $i]. \n"
    }
}

I know it sounds like you have been second guessed twice, but without more insight its hard to give good advice. Was your real question - "What is wrong with my regex?" :)

the data can easily be modeled in the relational database. and that's how I had had it. But, if I do this it requires a number of extra trips to the database. If I could parse this I could reduce the number of trips to and from the DB to 3 - with a separate trip to the DB to get these variables it will require up to 15 trips for displaying just one page (which will substantially slow things down compared to the 3 trips). the thing is that most attributes only have one variable. or value, so my query has only the attribute_id and the value returned. For example, in my contacts database, if it asks what gender you are it returns either male of female. However, things like addresses are more complex, and  have multiple fields. thus if i get all the answers in a query where one column stays the attribute_id and the other the value i won't work for those - multi-value attributes... i don't know if this is making any sense to any of you ... i think i'll dicuss this in detail with the people i will be developing this contacts app with - it would take too explain the exact reasoning... other than to say that to make it work would improve performance substantially. I’ll take a look at ns_xml and tDom - i haven't used it yet and guess i should learn about them. I figured my regexp would work.

So I’ll take Alfred's suggestion: does anybody know of the correct this regexp?

The only reason I would see to use XML is if you need to store multiple addresses per 'record', in a nested format, or if you are gonna further pass that data outside to some other machine that wants XML data.

Here's what I mean:

<contact>

<home>
<address>
<street>1 Main St</street>
<city>Plainville</city>
<state>New York (or NY)</state>
</address>
<phone>
<number>123-4567</number>
<type>mobile</type>
</phone>
</home>

<work>
<address>
<street>1 Main St</street>
<city>Plainville</city>
<state>New York (or NY)</state>
</address>
<phone>
<number>123-4567</number>
<type>mobile</type>
</phone>
</work>

</contact>

I'm ignoring that tags don't need to balance in XML, (e.g. <number/>123-4567 could be valid )..

If it's nesting, the regex won't help with balanced tags or not, because you'll have two that match <street>xxx</street> or <street/> .. You really need to parse the tree, which is what ns_xml and similar tools are designed to do.

Matthew,

Sorry for not answering your real question :)

Tcl won't evaluate the variable inside of  {}. Try putting the regexp inside double quotes.

A fast way to test regular expressions is to use tclsh and experiment.

I really dislike the notion of pretending you're not using an RDBMS by using XML to hide structure from the datamodel.  Among other things, having much of the toolkit use the RDBMS as its designed to be used while other parts hide the datamodel as an XML string makes interoperability between pieces of the toolkit more difficult, increases the learning curve needed to grasp it, and the amount of trivia one needs to remember to hack in it ("gee, does this package stuff XML strings or model data directly?  I don't remember...")

Of course if this package isn't meant to ever be a candidate for inclusion in the toolkit distribution feel free to do whatever you want.

Having said this ... if you insist on this approach, you really should properly parse the document using ns_xml or (preferably, since we're switching) tDOM.  Generalized XML is not parsable by simple regexps, if at some point your XML description crosses that threshold you're going to have to switch to a proper parser anyway.

Is the data being stored as xml, or is it already in the database as individual fields? It seems unlikely that a single page would ask for 15 independent pieces of data to present to a user. Usually a query can be structured to bring all the data needed from the database in a few queries, then your tcl script can choose what to actually display.

More info on your problem would be helpful.

I have an example of exactly this sort of thing in my "Upgrading ACS to OpenACS" document.

I consider it to be hack-level, but for a one-time upgrade, it seems like it will do the work.

I'll ignore whether or not this is a good idea, or whether it would be better to do it with tDOM. If you're looking for the regexp, see:

http://rubick.com/openacs/acs_to_openacs

thanks for the replies - i'll scratch the idea as i certainly do want this included in the regular toolkit... and apart from this example i'm pretty sure it will be very useful to a number of openacs users. i guess i was over complicating things :)