Forum OpenACS Q&A: What is the best way to store long texts in PostgreSQL?

1: What is the best way to store long texts in PostgreSQL?

Posted by Peter Marklund on 09/05/02 10:54 AM

So here is another question in my "What is the best way" series. I
need to store patches for the Bug Tracker, i.e. I need to store a
potentially large chunk of text (although in my experience patches are
usually not too big).

As a first simple solution I stuff the patch file into a text column.
I tested this approach with a huge patch file (36,000 lines, 11 MB)
and it worked ok although it took quite a while to upload. Doing
select * from the table in psql caused a segmentation fault though...

What do you guys think - should I be using Don's storage mechanism
provided by acs-kernel/sql/postgresql/lobs.sql for patches?

I experimented with using File Storage for patches, but that approach
is a little involved. It causes a dependency on the File Storage app,
requires a root folder and also, there is no TCL proc for adding a
file to file storage (although one could easily be extracted from the
file-add-2.tcl file in file storage, I noticed that attachments has
duplicated this file, I don't know why).

2: Response to What is the best way to store long texts in PostgreSQL? (response to 1)

Posted by Jun Yamog on 09/05/02 12:30 PM

Hi Peter,

I think the best way to make a content type on the Content Repository (CR). Have this content type store the data into either file or lob. I am not yet familiar with Bug Tracker but I think it would best if Bug Tracker made use of CR to store the patches and even the description etc. File-Storage, ETP, CMS are just UI and extensions of CR.

In a nutshell here is how I will approach the problem:

- create a bugtacker_item and bugtracker_patch content type. See content_type__create

- for each bug tracker instance create a cr_folder and register the custom content types into this folder. This is optional since you may not need a folder paradigm anyway. You can also make use of your own table and point the bugtracker_item parent_id to its primary key.

- A bug will be stored in bugtracker_item and the patch files on the bugtracker_patch content type. The bugtracker_patch will point its parent_id to the bugtracker_item.

Content Repository is part of the core anyway so this should not add any dependency to the bug tracker. If the bugtracker is more or less finished you can just create a bugtracker_patch content type and store the patch files on the CR. Store them in either file or lob storage type. And point the parent_id of the bugtracker_patch to the object_id of the bug entry. I hope this helps.

For a sample code of this I put in a feature on ticket-tracker lite to attach files. The files are stored in CR. Of course this code was made about a year ago. So some approaches are not done properly. If you still want to make use of file-storage you may want to look at the attachment package at dotLRN. Attachment makes us of the file storage as its backend.

3: Response to What is the best way to store long texts in PostgreSQL? (response to 1)

Posted by Don Baccus on 09/05/02 04:01 PM

Adding to what Jun said file storage is just a relatively lightweight wrapper around the CR. Check out cr_import_content in acs-content-repository/tcl/revision-procs.tcl. It only handles stuffing in file content into the file system or a large object (LOB in Oracle, PG driver hack in PG). It is assumed the file content is of arbitrary type so can't be stored directly as text or CLOB.

There is no corresponding utility to import a known text file or form entry as CLOB or text (the CR allows for storing text directly in PG). There should be, though. The cr_import_content utility itself should handle known text files (i.e. pass in a switch that tells it the file's known to be text). And an additional utility should be written to take a Tcl string and stuff it in the CR.

In general I'd say store known text as text unless you store items in the file system rather than the db. In Oracle, it's still stored as a LOB. In PG storing known text in the revision's text column should be considerably more efficient than using the driver LOB hack. The LOB hack uuencodes content (which is considered to be arbitrary binary content) on input and uudecodes on output, which is relatively expensive. The encoded data is 33% longer than the original, too. It does this because there's no general PG API for inserting arbitrary binary data outside of the rather strange large object implementation which had lots of problems when we started this project (at the time you couldn't make a consistent dump of your db if you used PG large objects, sort of a killer and that was just one problem).

As far as the psql seg fault ... please report that to the PG folks!

I'd recommend making a folder for the bug tracker instance (perhaps every project in the bugtracker instance?). Why? We've been talking about hooking webdav up to the CR, for one thing, and this would allow for remote browsing and viewing and (most importantly) editing of content outside of a browser.

The CR is admittedly awkward to use but isn't that bad once you play with it some. General routines like cr_import_content help a lot.

In 4.7 we should remove fs_root_folders and move the package id into cr_folders. If not objects themselves. The question "which package created this folder?" or "which package created this object?" are increasingly common ones (notifications needs the latter, for instance) ...

In 4.7 I also want to expose a bunch of the existing CMS "build content" Tcl API at a higher level and Lars has talked about even more ambitious efforts to make the CR (and objects in general) easier to use.