Forum OpenACS Q&A: Re: referencing a specific db pool

11: Re: referencing a specific db pool (response to 10)

Posted by Tom Jackson on 05/18/04 05:56 AM

This is something that is just crying to be done in pl. I moved a select heavy procedure from tcl to plsql and saw the time go from 8 hours to 8 minutes on a process we wanted to run ever two hours, and also greatly increased the functionality. It also really lowers the load on the server.

But there isn't anything really bad about a ten minute query if it is only run once a month, and you are only able to lower it to four minutes.

12: Re: referencing a specific db pool (response to 11)

Posted by Mark Aufflick on 05/18/04 11:55 AM

Yes - but in postgres a pl/pgsql function is implicitly a transaction, thus running into the problems mentioned above...

13: Re: referencing a specific db pool (response to 12)

Posted by Jeff Davis on 05/18/04 12:04 PM

If in the end you are just doing a collection of inserts you
might consider doing it all with a COPY command (you can
look at db_load_sql_data and ref-timezones for examples
of loading data with copy).

14: Re: referencing a specific db pool (response to 12)

Posted by Tom Jackson on 05/18/04 05:21 PM

I think what I did was setup a controller function and then another worker function which ran each operation. It had to be done this way for the similar reasons: user data might crash the process, but in that case only a single record would be affected. So in the worker function, catch the exception, and return 0 or 1 based on success/failure. On failure, run another function to do your logging.

This is always the case in pl, even if you can catch an exception, you cannot continue in the body of the function and you have to cleanup and return.

Moving stuff around inside the database is always going to be much quicker than going back and forth to the webserver for control.

15: Re: referencing a specific db pool (response to 14)

Posted by Mark Aufflick on 05/18/04 05:35 PM

ah - so you're saying that if i have a plpgsql function that iteratively calls another plpgsql function, and the query inside the inner function fails (and rolls back the querie(s) in that function, the OUTER function continues?

16: Re: referencing a specific db pool (response to 15)

Posted by Tom Jackson on 05/18/04 06:23 PM

I know it works that way in plsql, and plpgsql is very similar, you add an exception block to the bottom of the function, something like:

   EXCEPTION
     WHEN OTHERS
     THEN
        return 1;
   END;

The WHEN OTHERS clause handles all exceptions not otherwise handled. If you exect certain types of errors and want to distinguish between them, you need to include them first.

If you have several dml statements that need to be executed, and you want as many of them to succeed as possible, you can follow the advice of _Oracle PL/SQL Programming, page 266-267 and have each statement in its own block:

 BEGIN
  BEGIN
   DELETE FROM employee WHERE ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;

  BEGIN 
   UPDATE company SET ...;
  EXCEPTION
   WHEN OTHERS THEN NULL;
  END;
 END;

Using this method, you could do away with the second function, and use a sub-block to do the work:


 LOOP
   BEGIN
     ...;
   EXCEPTION
    WHEN OTHERS THEN NULL;
   END;
 END LOOP;

This works in pl/sql, you might try a simple example in pl/pgsql.

17: Re: referencing a specific db pool (response to 15)

Posted by Mark Aufflick on 05/18/04 06:35 PM

Postgres doesn't support subtransactions, so any error anywhere between the start and end of a transaction will roll right back to the start.

Postgres also has no exception handling - you can run a transaction on exception to log it or similar, but you cannot prevent the rollback.

Now that I have added the second (simpler but more data) part of the end of month process, processing time has blown out to 30 mins - and the amount of data will increase as sales increase.

I am going to try parallelizing the two which should reduce the overall execution time.

Or, I guess I just abandon my idea and manually enforce the business rules as well as allowing constraints to do their thing and catch the errors.

Such a nice idea though...

18: Re: referencing a specific db pool (response to 17)

Posted by Jeff Davis on 06/07/04 10:35 AM

There was a post by Markus Schaber on benchmarking various insert methods on the pgsql-performance list. The very short answer is that copy is the fastest by a large margin and individual inserts are very slow.:

Here are the results of some benchmarks we did inserting 30k rows into a
table, using "time psql -f blubb.sql -q dbname":

File            kingfisher              skate
30kinsert.sql   39.359s    762r/s       335.024s           90r/s
30kcommonce.sql 11.402s   2631r/s         7.086s         4233r/s
30kwithids.sql  10.138s   2959r/s         6.936s         4325r/s
30kprepare.sql   8.173s   3670r/s         5.189s         5781r/s
30kdump.sql      1.286s  23328r/s         0.785s        38216r/s
30kdumpseq.sql   1.498s  20026r/s         0.927s        32362r/s


Kingfisher is the single processor machine I mentioned yesterday night,
skate the SMP machine.

The table has five rows (bigint, bigint, double, double, timestamp
without time zone). The first of them has a "default nextval('sequence'
::text)" definition, and there are no further constraints or indices.

The 30kinsert.sql uses simple insert statements with autocommit on, and
we insert all but the first column which is filled by the default
sequence. With this test, kingfisher seems to have an irrealistic high
value of commits (syncs) per second (see what I wrote yesterday) [1],
skate has a more realistic value.

30kcommonce.sql, as suspected, gives a rather high boost by
encapsulating all into a single commit statement.

30kwithids gives a small boost by inserting pre-calculated sequence
numbers, so it seems not worth the effort to move this logic into the
application.

30kprepare prepares the insert statement, and then issues 30k EXECUTE
statements within one transaction, the speedup is noticeable.

30kdump simply inserts the 30k rows as a dump via COPY FROM STDIN. (as
with 30kwithids, the first column is contained in the insert data, so
the default value sequence is not used). Naturally, this is by far the
fastest method.

30kdumpseq.sql uses COPY, too, but omits the first column and such
utilizes the sequence generation again. This gives a noticeable 15%
slowdown, but seems to be still fast enough for our purposes. Sadly, it
is not available within jdbc.

Thanks for your patience.

Footnotes: 
[1] We suspect this to be some strange interaction between ide,
cryptoloop and ext3fs, so that the sync() call somehow does not really
wait for the data to be physically written to the disk. (I really can't
imagine a crypto-looped notebook harddisk to do more syncs/second than a
SCSI-Based RAID in a server machine. We did some small benches on the
sync() / fsync() calls that seem to prove this conclusion.)
-- 
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich