Forum OpenACS Q&A: referencing a specific db pool

Posted by Mark Aufflick on 05/17/04 05:07 PM

Hi all,

I have a data processing function that gets run at every end of month. it has a single level of nested queries (ie. a db_foreach with multiple queries inside it).

I want to set autocommit to off for the second pool to speed it up some (i believe it will do?).

That's the easy part - i can set it to off on every loop if i like.

The hard part is committing (which I obviously have to do) and setting autocommit back to on (or will that just go away when the session closes?).

Once i'm out of the db_foreach, how do i run a db_dml against a specific pool?

I guess I could do a:

db_foreach dummy "select 1" {
db_dml foo ""
}

but that makes me feel dirty...

2: Re: referencing a specific db pool (response to 1)

Posted by Mark Aufflick on 05/17/04 05:37 PM

ok - i avoided the multiple pools thing by slurping the whole loop query into a multirow and used the useful (but cryptic) template::util::multirow_foreach instead

but - set autocommit is not supported by postgres anymore...

which kills my idea since I want to continue on even if some rows fail to insert due to constraints (I capture that failure inside a catch and log it to an issues_log table).

but inside a transaction that will kill the transaction.

does anyone have any smart ideas?

4: Re: referencing a specific db pool (response to 2)

Posted by Tom Jackson on 05/17/04 06:13 PM

You tried this? An error shouldn't automatically rollback the transaction, I don't think. If you catch the error inside the transaction and continue, or is this wrong?

3: Re: referencing a specific db pool (response to 1)

Posted by Mark Aufflick on 05/17/04 06:20 PM

here's what you can do at run-time these days:

set commit_delay to 1000 : wait for 1000ms before calling fsync just in case someone else commits in that time & we can fsync together
set commit_siblings to 1 : but only if at least 1 other transaction is processing (you can't set this to 0)

so - to enforce at least one other transaction to be running, I put the whole darn thing back inside the db_foreach and wrapped the whole shebang inside a db_transaction.

I put the set commands inside the db_foreach to make sure they hit the right db connection pool/session

then i need to restore the previous values because they are remembered across sessions and i like to sleep at night.

i would rather do this in a dedicated pool, but it's a lot of overhead keeping a pool arount to be used once a month. let's hope we don't crash. or if we do, nothing important should be using the second pool...

6: Re: referencing a specific db pool (response to 1)

Posted by Mark Aufflick on 05/17/04 06:22 PM

Tom: we can only catch the tcl exception raised by db_string, but it's the postgres driver that decides a transaction has failed.

I don't believe there is a way to "catch" an exception in sql/postgres

5: Re: referencing a specific db pool (response to 4)

Posted by Jeff Davis on 05/17/04 06:24 PM

Tom, if you have an sql error in a transaction on postgres you cannot continue. This is possible to do in oracle but postgres does not support those semantics.

You could catch tcl errors and continue with the transaction but that's not what's going on here.

Of course I could be wrong here (and I wish I was since what Mark is trying to do is an immensely useful idiom for mapping tables for example)...

7: Re: referencing a specific db pool (response to 1)

Posted by Mark Aufflick on 05/17/04 06:32 PM

having the settings roughly as above halved the execution time to select, munge and insert ~1200 rows from 4 mins to 2 mins.

playing with the delay figure (even by factors of 10 or 100) made no noticeable impact. it seems as long as you wait a bit you get the improvement.

8: Re: referencing a specific db pool (response to 1)

Posted by Mark Aufflick on 05/17/04 06:40 PM

Jeff: I thought i could do it somewhere - must be an old oracle memory!

The old postgres set autocommit to 'off' would have had the same effect - i really don't care about rollback ability, i just don't want to fsync on every insert since i'll be doing ~8000 of them!

Of course I could manually check all the constraints on the input data and filter those rows out first, but this way is so elegant, an ensures there is no mismatch between the way you enforce your business rules.

The sync options I came up with above are still significantly slower than inside a transaction, so i'm still very open to ideas. In fact sometimes it doesn't speed things up much at all - i suspect the MORE loaded the database the less fsyncs will occur in this case.

9: Re: referencing a specific db pool (response to 5)

Posted by Tom Jackson on 05/17/04 07:55 PM

Hmm, that is good to know, I guess I somehow didn't understand the comment in the pg driver:

        /* We'll kick out with an NS_ERROR if we're in a transaction.
         * The queries within the transaction up to this point were
         * rolled back when the transaction crashed or closed itself
         * at the request of the postmaster.  If we were to allow the
         * rest of the transaction to continue, you'd blow transaction
         * semantics, i.e. the first portion of the transaction would've
         * rolled back and the rest of the transaction would finish its
         * inserts or whatever.  Not good!   So we return an error.  If
         * the programmer's catching transaction errors and rolling back
         * properly, there will be no problem - the rollback will be
         * flagged as occuring outside a transaction but there's no
         * harm in that.
         *
         * If the programmer's started a transaction with no "catch",
         * you'll find no sympathy on my part.
         */

It seems a catch only makes things worse. Why have a catch if the result is not aborting page processing and possibly using a transaction handle again, which will lead to another error?

The original question however seems to suggest that each insert should be in a separate transaction anyway, since the desired result is to 'not' tie one query to another using the semantics of a transaction. Should the transaction block be used just because the result is a faster insert?

If you are doing the same insert over and over, would it be worth writing a special function to do the job, then calling that:

select my_insert(...)

10: Re: referencing a specific db pool (response to 9)

Posted by Mark Aufflick on 05/18/04 01:30 AM

That comment is interesting.

You are right it's a bit bogus to do a transaction judt for performance - but i'm talking 10 minutes down to 4 for a batch n end of month process.

The PG documentation also talks about things being faster inside transactions due to this very issue.

And the only reason that I tried it is because there is no other way (I can see) to turn off auto commit mode.

If you can't turn it off is it a mode or a state of being ;)

The pl/pgsql function won't help since PG will call an fsync stright after it as well - and that is the only speed issue, not any query planning etc.

Ah. Well, now i'm going to try parallelizing it (I write to two tables) to see if that speeds it up...

11: Re: referencing a specific db pool (response to 10)

Posted by Tom Jackson on 05/18/04 05:56 AM

This is something that is just crying to be done in pl. I moved a select heavy procedure from tcl to plsql and saw the time go from 8 hours to 8 minutes on a process we wanted to run ever two hours, and also greatly increased the functionality. It also really lowers the load on the server.

But there isn't anything really bad about a ten minute query if it is only run once a month, and you are only able to lower it to four minutes.

12: Re: referencing a specific db pool (response to 11)

Posted by Mark Aufflick on 05/18/04 11:55 AM

Yes - but in postgres a pl/pgsql function is implicitly a transaction, thus running into the problems mentioned above...

13: Re: referencing a specific db pool (response to 12)

Posted by Jeff Davis on 05/18/04 12:04 PM

If in the end you are just doing a collection of inserts you
might consider doing it all with a COPY command (you can
look at db_load_sql_data and ref-timezones for examples
of loading data with copy).

14: Re: referencing a specific db pool (response to 12)

Posted by Tom Jackson on 05/18/04 05:21 PM

I think what I did was setup a controller function and then another worker function which ran each operation. It had to be done this way for the similar reasons: user data might crash the process, but in that case only a single record would be affected. So in the worker function, catch the exception, and return 0 or 1 based on success/failure. On failure, run another function to do your logging.

This is always the case in pl, even if you can catch an exception, you cannot continue in the body of the function and you have to cleanup and return.

Moving stuff around inside the database is always going to be much quicker than going back and forth to the webserver for control.

15: Re: referencing a specific db pool (response to 14)

Posted by Mark Aufflick on 05/18/04 05:35 PM

ah - so you're saying that if i have a plpgsql function that iteratively calls another plpgsql function, and the query inside the inner function fails (and rolls back the querie(s) in that function, the OUTER function continues?

16: Re: referencing a specific db pool (response to 15)

Posted by Tom Jackson on 05/18/04 06:23 PM

I know it works that way in plsql, and plpgsql is very similar, you add an exception block to the bottom of the function, something like:

   EXCEPTION
     WHEN OTHERS
     THEN
        return 1;
   END;

The WHEN OTHERS clause handles all exceptions not otherwise handled. If you exect certain types of errors and want to distinguish between them, you need to include them first.

If you have several dml statements that need to be executed, and you want as many of them to succeed as possible, you can follow the advice of _Oracle PL/SQL Programming, page 266-267 and have each statement in its own block:

 BEGIN
  BEGIN
   DELETE FROM employee WHERE ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;

  BEGIN 
   UPDATE company SET ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;
 END;

Using this method, you could do away with the second function, and use a sub-block to do the work:


 LOOP
   BEGIN
     ...;
   EXCEPTION
    WHEN OTHERS THEN NULL;
   END;
 END LOOP;

This works in pl/sql, you might try a simple example in pl/pgsql.

17: Re: referencing a specific db pool (response to 15)

Posted by Mark Aufflick on 05/18/04 06:35 PM

Postgres doesn't support subtransactions, so any error anywhere between the start and end of a transaction will roll right back to the start.

Postgres also has no exception handling - you can run a transaction on exception to log it or similar, but you cannot prevent the rollback.

Now that I have added the second (simpler but more data) part of the end of month process, processing time has blown out to 30 mins - and the amount of data will increase as sales increase.

I am going to try parallelizing the two which should reduce the overall execution time.

Or, I guess I just abandon my idea and manually enforce the business rules as well as allowing constraints to do their thing and catch the errors.

Such a nice idea though...

18: Re: referencing a specific db pool (response to 17)

Posted by Jeff Davis on 06/07/04 10:35 AM

There was a post by Markus Schaber on benchmarking various insert methods on the pgsql-performance list. The very short answer is that copy is the fastest by a large margin and individual inserts are very slow.:

Here are the results of some benchmarks we did inserting 30k rows into a
table, using "time psql -f blubb.sql -q dbname":

File            kingfisher              skate
30kinsert.sql   39.359s    762r/s       335.024s           90r/s
30kcommonce.sql 11.402s   2631r/s         7.086s         4233r/s
30kwithids.sql  10.138s   2959r/s         6.936s         4325r/s
30kprepare.sql   8.173s   3670r/s         5.189s         5781r/s
30kdump.sql      1.286s  23328r/s         0.785s        38216r/s
30kdumpseq.sql   1.498s  20026r/s         0.927s        32362r/s


Kingfisher is the single processor machine I mentioned yesterday night,
skate the SMP machine.

The table has five rows (bigint, bigint, double, double, timestamp
without time zone). The first of them has a "default nextval('sequence'
::text)" definition, and there are no further constraints or indices.

The 30kinsert.sql uses simple insert statements with autocommit on, and
we insert all but the first column which is filled by the default
sequence. With this test, kingfisher seems to have an irrealistic high
value of commits (syncs) per second (see what I wrote yesterday) [1],
skate has a more realistic value.

30kcommonce.sql, as suspected, gives a rather high boost by
encapsulating all into a single commit statement.

30kwithids gives a small boost by inserting pre-calculated sequence
numbers, so it seems not worth the effort to move this logic into the
application.

30kprepare prepares the insert statement, and then issues 30k EXECUTE
statements within one transaction, the speedup is noticeable.

30kdump simply inserts the 30k rows as a dump via COPY FROM STDIN. (as
with 30kwithids, the first column is contained in the insert data, so
the default value sequence is not used). Naturally, this is by far the
fastest method.

30kdumpseq.sql uses COPY, too, but omits the first column and such
utilizes the sequence generation again. This gives a noticeable 15%
slowdown, but seems to be still fast enough for our purposes. Sadly, it
is not available within jdbc.

Thanks for your patience.

Footnotes: 
[1] We suspect this to be some strange interaction between ide,
cryptoloop and ext3fs, so that the sync() call somehow does not really
wait for the data to be physically written to the disk. (I really can't
imagine a crypto-looped notebook harddisk to do more syncs/second than a
SCSI-Based RAID in a server machine. We did some small benches on the
sync() / fsync() calls that seem to prove this conclusion.)
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich