Forum OpenACS Q&A: Re: referencing a specific db pool

Collapse
Posted by Jeff Davis on
Tom, if you have an sql error in a transaction on postgres you cannot continue. This is possible to do in oracle but postgres does not support those semantics.

You could catch tcl errors and continue with the transaction but that's not what's going on here.

Of course I could be wrong here (and I wish I was since what Mark is trying to do is an immensely useful idiom for mapping tables for example)...

Collapse
Posted by Tom Jackson on

Hmm, that is good to know, I guess I somehow didn't understand the comment in the pg driver:

        /* We'll kick out with an NS_ERROR if we're in a transaction.
         * The queries within the transaction up to this point were
         * rolled back when the transaction crashed or closed itself
         * at the request of the postmaster.  If we were to allow the
         * rest of the transaction to continue, you'd blow transaction
         * semantics, i.e. the first portion of the transaction would've
         * rolled back and the rest of the transaction would finish its
         * inserts or whatever.  Not good!   So we return an error.  If
         * the programmer's catching transaction errors and rolling back
         * properly, there will be no problem - the rollback will be
         * flagged as occuring outside a transaction but there's no
         * harm in that.
         *
         * If the programmer's started a transaction with no "catch",
         * you'll find no sympathy on my part.
         */

It seems a catch only makes things worse. Why have a catch if the result is not aborting page processing and possibly using a transaction handle again, which will lead to another error?

The original question however seems to suggest that each insert should be in a separate transaction anyway, since the desired result is to 'not' tie one query to another using the semantics of a transaction. Should the transaction block be used just because the result is a faster insert?

If you are doing the same insert over and over, would it be worth writing a special function to do the job, then calling that:

select my_insert(...)
Collapse
Posted by Mark Aufflick on
That comment is interesting.

You are right it's a bit bogus to do a transaction judt for performance - but i'm talking 10 minutes down to 4 for a batch n end of month process.

The PG documentation also talks about things being faster inside transactions due to this very issue.

And the only reason that I tried it is because there is no other way (I can see) to turn off auto commit mode.

If you can't turn it off is it a mode or a state of being ;)

The pl/pgsql function won't help since PG will call an fsync stright after it as well - and that is the only speed issue, not any query planning etc.

Ah. Well, now i'm going to try parallelizing it (I write to two tables) to see if that speeds it up...

Collapse
Posted by Tom Jackson on

This is something that is just crying to be done in pl. I moved a select heavy procedure from tcl to plsql and saw the time go from 8 hours to 8 minutes on a process we wanted to run ever two hours, and also greatly increased the functionality. It also really lowers the load on the server.

But there isn't anything really bad about a ten minute query if it is only run once a month, and you are only able to lower it to four minutes.

Collapse
Posted by Mark Aufflick on
Yes - but in postgres a pl/pgsql function is implicitly a transaction, thus running into the problems mentioned above...
Collapse
Posted by Jeff Davis on
If in the end you are just doing a collection of inserts you
might consider doing it all with a COPY command (you can
look at db_load_sql_data and ref-timezones for examples
of loading data with copy).
Collapse
Posted by Tom Jackson on

I think what I did was setup a controller function and then another worker function which ran each operation. It had to be done this way for the similar reasons: user data might crash the process, but in that case only a single record would be affected. So in the worker function, catch the exception, and return 0 or 1 based on success/failure. On failure, run another function to do your logging.

This is always the case in pl, even if you can catch an exception, you cannot continue in the body of the function and you have to cleanup and return.

Moving stuff around inside the database is always going to be much quicker than going back and forth to the webserver for control.

Collapse
Posted by Mark Aufflick on
ah - so you're saying that if i have a plpgsql function that iteratively calls another plpgsql function, and the query inside the inner function fails (and rolls back the querie(s) in that function, the OUTER function continues?
Collapse
Posted by Tom Jackson on

I know it works that way in plsql, and plpgsql is very similar, you add an exception block to the bottom of the function, something like:

   EXCEPTION
     WHEN OTHERS
     THEN
        return 1;
   END;

The WHEN OTHERS clause handles all exceptions not otherwise handled. If you exect certain types of errors and want to distinguish between them, you need to include them first.

If you have several dml statements that need to be executed, and you want as many of them to succeed as possible, you can follow the advice of _Oracle PL/SQL Programming, page 266-267 and have each statement in its own block:

 BEGIN
  BEGIN
   DELETE FROM employee WHERE ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;

  BEGIN 
   UPDATE company SET ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;
 END;

Using this method, you could do away with the second function, and use a sub-block to do the work:


 LOOP
   BEGIN
     ...;
   EXCEPTION
    WHEN OTHERS THEN NULL;
   END;
 END LOOP;

This works in pl/sql, you might try a simple example in pl/pgsql.

Collapse
Posted by Mark Aufflick on
Postgres doesn't support subtransactions, so any error anywhere between the start and end of a transaction will roll right back to the start.

Postgres also has no exception handling - you can run a transaction on exception to log it or similar, but you cannot prevent the rollback.

Now that I have added the second (simpler but more data) part of the end of month process, processing time has blown out to 30 mins - and the amount of data will increase as sales increase.

I am going to try parallelizing the two  which should reduce the overall execution time.

Or, I guess I just abandon my idea and manually enforce the business rules as well as allowing constraints to do their thing and catch the errors.

Such a nice idea though...

Collapse
Posted by Jeff Davis on
There was a post by Markus Schaber on benchmarking various insert methods on the pgsql-performance list. The very short answer is that copy is the fastest by a large margin and individual inserts are very slow.:
Here are the results of some benchmarks we did inserting 30k rows into a
table, using "time psql -f blubb.sql -q dbname":

File            kingfisher              skate
30kinsert.sql   39.359s    762r/s       335.024s           90r/s
30kcommonce.sql 11.402s   2631r/s         7.086s         4233r/s
30kwithids.sql  10.138s   2959r/s         6.936s         4325r/s
30kprepare.sql   8.173s   3670r/s         5.189s         5781r/s
30kdump.sql      1.286s  23328r/s         0.785s        38216r/s
30kdumpseq.sql   1.498s  20026r/s         0.927s        32362r/s


Kingfisher is the single processor machine I mentioned yesterday night,
skate the SMP machine.

The table has five rows (bigint, bigint, double, double, timestamp
without time zone). The first of them has a "default nextval('sequence'
::text)" definition, and there are no further constraints or indices.

The 30kinsert.sql uses simple insert statements with autocommit on, and
we insert all but the first column which is filled by the default
sequence. With this test, kingfisher seems to have an irrealistic high
value of commits (syncs) per second (see what I wrote yesterday) [1],
skate has a more realistic value.

30kcommonce.sql, as suspected, gives a rather high boost by
encapsulating all into a single commit statement.

30kwithids gives a small boost by inserting pre-calculated sequence
numbers, so it seems not worth the effort to move this logic into the
application.

30kprepare prepares the insert statement, and then issues 30k EXECUTE
statements within one transaction, the speedup is noticeable.

30kdump simply inserts the 30k rows as a dump via COPY FROM STDIN. (as
with 30kwithids, the first column is contained in the insert data, so
the default value sequence is not used). Naturally, this is by far the
fastest method.

30kdumpseq.sql uses COPY, too, but omits the first column and such
utilizes the sequence generation again. This gives a noticeable 15%
slowdown, but seems to be still fast enough for our purposes. Sadly, it
is not available within jdbc.

Thanks for your patience.

Footnotes: 
[1] We suspect this to be some strange interaction between ide,
cryptoloop and ext3fs, so that the sync() call somehow does not really
wait for the data to be physically written to the disk. (I really can't
imagine a crypto-looped notebook harddisk to do more syncs/second than a
SCSI-Based RAID in a server machine. We did some small benches on the
sync() / fsync() calls that seem to prove this conclusion.)
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich