Forum OpenACS Q&A: Would copy speed PG installs of reference data?

Some of the ref-* modules contain a lot of reference data, as SQL
INSERT statements.  A couple have 30000 or 40000 INSERTs.  As a
result, these modules take several minutes to load, and generate a LOT
of screen output.

I'm wondering whether storing the data as delimited text files and
using the psql copy command would speed things up significantly? It
seems to in my ad hoc testing... 42K records are copied into
us_zipcodes in under ten seconds from a file, but 42K INSERT INTOs
take 2 or 3 minutes (and this generates 42K lines of junk to the
screen in the process, unless you use the -q switch, which the
OpenACS4 package installer does not seem to do).  Even with -q it
takes 100 seconds,
ten times slower than the copy command.

Is there any reason not to switch to using copy for reference data
files with more than (say) 1000 records in them?  The equivalent bulk
load utility for Oracle could also be used, unless Oracle is plenty
fast enough already doing this sort of thing?

Collapse
Posted by Don Baccus on
The "insert" format's portable so we don't need to massage the data twice when we update it.  For instance during the so-called "energy crisis" last summer there was talk in California of going to double-daylight savings time.  That would've required new files to be generated (and screwed up all timezone data in the world which uses Los Angeles as a surrogate for the Pacific Timezone!)

So we're trading off speed of loading for easier maintainability and of course an easier porting task if we adopt a third RDBMS in the future.

I understand (and agree with) the reluctance to store the data in multiple forms.  The comma delimited data format can be generated on the fly from the SQL, at least for the case I tested with, just using grep and sed.  The timezone data is perhaps a little more awkward, but surely not impossible.

So there is (I think) no need to store the comma delimited form in CVS or tarballs or RPMs.  Just generate it at install time, before doing the copy or Oracle equivalent.  That way the editable, developer-preferred, format of the data remains the SQL statements as they are now.

It's not the end of the world if we can't do this, but if I automate
things further, so that non-core packages get installed automatically by my RPMs, the delay during the install will grow, and may worry some novice installers enough that they abort it thinking something "must have gone wrong".  I noticed that just a very few packages were responsible for the bulk of the package install time, so I took a quick look, and this seems like a useful way to shorten that delay.

Collapse
Posted by Don Baccus on
Generating the COPY format from the source is an interesting idea.  The tools should be available in any Unix environment ... how about Cygwin?  Do such tools come standard in the Cygwin environment?
sed and grep both exist in a standard Cygwin install, so that should be fine.  I don't know if there would be any line ending issues (LF vs CRLF) between the Unix psql and the Cygwin one, though?
Collapse
Posted by Don Baccus on
I know for a fact that has bitten people in the past ...