Forum OpenACS Q&A: Insert Performance Issues - PG 7.2.3/OpenACS 4.5/perl/dbi

Greetings,

I am currently developing a data aggregation system running on PostgreSQL 7.2.3 and OpenACS 4.5 that takes inputs from external sources (subscription business databases).  I decided to go with OpenACS because of the user interface capabilities which I felt would come in handy down the line.

At this point, I have:

a inherits from acs_objects
x inherits from a
y inherits from a

using the example of the Notes application.

My issue is that when I do big data acquisitions, I drop the indices, vacuum the database (with no switches) and it takes 6-7 seconds to do an add (ie. x__new), which I think can be improved upon.

I am currently pulling in data using perl/dbi on a RedHat 8 machine, P3/600 with 384MB and an old 10GB 5400rpm drive.

I was thinking...

1. When I did the install, I was only running with 128MB RAM, it is now 384MB.  Does Postgres require reconfiguration of some sort to take advantage of this?  I skimmed the PostgreSQL admin manual and didn't really find anything interesting about caching.

2. When I do a 'ps', it tells me that it is using roughly 82% of the processor.  Is it time to upgrade the CPU?  Is it time to go dual processor, possibly?

3. The hard drive is older and I just ordered a 7200rpm 80GB drive.  Will that help me significantly?

4. x__new calls a__new calls acs_object__new, which is fairly expensive, I would imagine.  Would it be wise to consider using the PostgreSQL inheritance instead?

Anybody have any experience with this type of thing?

Thanks.

Here are my suggestions:

1.  increase the WAL files setting.  I think the usual setting is 0.  I usually increase it to 3 or 4.

2.  increase the sort_mem from 512 to 2048; this means "use up to 2MB RAM for in-memory sorting before dumping the data to disk and using the slower disk-based sorting routine".  512KB is not enough.

3. Check the fsync parameter - this varies by platform.  Turning it off is less safe but faster.

4.  Increase the shared_buffers setting.

5.  Make sure you have some checkpoint_segments (number greater than 0) .

Remember to restart the server after each change and test.  If using the pg_ctl program, check the actual logfile, since it has the bug that if the server exits due to a misconfigured .conf file it will still report success.

Hi John - One good question to ask is why are you making every entry into an acs_object? I'm writing a similar application which also uses perl/dbi to input very large amounts of data into a couple of tables and so far I have no big performance issues. I would suggest that perhaps you may need to rethink your data model. From what you relay here, I think that's where you'll get your best performance gains. There are some good threads on the subject of the abuse of acs_objects. Just do a search via google if you have a problem bringing it up via the search engine. Unfortunately, I don't have the links handy...
Sorry - forgot to mention the following. Consider upgrading to 4.6.2 on PG 7.2.4 which is the current best stable oacs version. If you are running 4.5, you are missing out on some excellent performance enhancements made by Don and other members of the core development team. I just completed a full test of 4.6.2/7.2.4 and it really works beautifully. The next step is to bring the platform to compatibility with PG 7.3.2.
Sorry - I forgot to add one thing. Consider moving to oacs 4.6.2 on PG 7.2.4 which is the latest stable release. I just completed a very thorough test of this configuration and it works really beautifully. If you are on 4.5 you are missing out on some really cool performance enhancements and other stuff made by Don and the core oacs dev team to the platform. The next step is to bring the platform to compatibility with PG 7.3.2.
Don't turn off fsync if you value your data!  It's very necessary and if you turn it off you're losing all the data integrity attributes of using a proper RDBMS.

With 384MB of memory you can increase sort mem size well above 2MB, even, though that suggestion is much better than the default.

But the shared buffer pool must be greatly increased.  It's mentioned above but the importance isn't stressed.  Postgres, by default, only caches 64 blocks in shared memory and with that much memory you should increase the size to a much larger value.

Randy's suggestions are good, too - don't make things objects unless they need to be, which normally implies you need the auditing information built into objects (creation time, user, etc), you need to explicitly manage permissions on the object etc.

We don't use Postgres inheritance because it is broken, in essence.  There's a document lying around in file storage that explains why, the reasons are many.

Wow, thanks for all the quick replies.

After thinking about it some, I think that at least one of my data types doesn't need to be an acs_object.  I appreciate the warning regarding inheritance.

As I was trying to build a new data model to shift everything over, I decided to look up the acs_object in the schema browser.

The excerpt:

---
create sequence t_acs_object_id_seq;
create view acs_object_id_seq as
select nextval('t_acs_object_id_seq') as nextval;

create table acs_objects (
    object_id        integer not null
                constraint acs_objects_pk primary key
...
---

Indicates that a sequence is created when supposedly, it is possible to do this using the "serial" and "bigserial" datatype.  Do the serial/bigserial datatypes work?

I tried to create a table:

create table test (
  a      serial
  b      integer
);

and then attempted an

insert into test values (null, 2);

which fails.  I had been most familiar with mysql and the "integer auto_increment" specification where that generally works.

Can anybody briefly describe why using serial/bigserial is a bad idea (ie. like using PG inheritance?)

Serial's not a bad thing, but Oracle doesn't have a serial datatype.  It provides sequences (PG's sequences are modelled after Oracle's.)

We use sequences in our PG code mostly because we support both Oracle and PG, and using sequences leads to our being able to share a lot more queries between the two than would be true if we used the serial datatype in PG.  Our datamodels for Oracle  and PG are more closely alike, too, as a consequence of this decision.

Sequences are also more flexible in practice, and PG's serial datatype is implemented using sequences.

In regard to your example, it has been so long since I've experimented with the serial datatype in PG that I don't remember the details of how it works ...

This optimization stuff is good - enough detail to put into the docs.  Could someone take a minute to outline a few ways to test success of optimization?  Obviously if the database server crashes on startup you failed.  But what are some easy stress tests?  What instrumentation is needed to monitor variables?  And what values for the variables are good or bad?  (I.e., if I change config param X and run query Y, then while watching variable Z I should see it stay below W.)
In an attempt to take the advice of Randy, I am trying to ween myself off of the motherly love of acs_object.  There were some things that were extremely nice about acs_object that I'm trying to replicate, most importantly creating an auto incrementing index for a table and being able to return that index for a new function.

I've created a table:

create table test2 (
  i        serial,
  txt      varchar(20)
);

I'm trying to build a new function, "test2__new" which will insert a new string into the table and automatically give it a new i value.

I have:

create function test__new (varchar)
return integer as '
declare
  new__text    alias for $1;
begin
  select test2_i_seq.nextval
  into v_index_id;

insert into test2
  (i, txt)
values
  (v_index_id, new__text);

return v_index_id;
end; ' language 'plpgsql';

---

Whenever I try to load this new function, it always gives me "parse erorr at or near "return"".  I am not sure what I'm missing.

Also, I'm trying to find some code that exports an SQL query to Excel/CSV via the OpenACS interface.  I did some searching and found that it is supposed to be a feature of the simple-survey module of dotLRN.  After downloading dotLRN 1.0, I haven't really been able to isolate it.  Can anyone suggest where I might be able to find this type of functionality?

Thanks.

John

declare "i" to be an integer, not serial, in your table2 since you're using nextval to grab the sequence value by hand.

You need to declare v_index_id in your function, like this:

v_index_id table2.i%TYPE;

(picks up the type from table2's i column automatically)

Other than that I don't see any error in your function but I could be wrong, of course ...