Forum OpenACS Q&A: Postgres feature questions
- Materialized views? Does Postgres have them?
- In Oracle, you can have a PL/SQL function return a REF cursor,
lets the function's caller manipulate a multi-row result set. I'm
pretty sure Postgres (or at least pgPL/SQL) does not support REF
cursors. True? But, just being able to transparently return rows
from a function would be much nicer than REF cursors, anyway. Can
Postgres do that? (Oracle can't. I hear MS SQL Server can.)
- Any experience with how Postgres compares to Oracle for analytic
and statistical functions? I mean stuff like averages, moving
averages, standard deviation, etc.
For example, PostgreSQL's support for user-defined SQL aggregate functions sounds nicer than having to shove anything like that into Oracle PL/SQL. And Bell Labs has experimented with embedding the R statistical language inside Postgres. (R is currently not thread safe, so this clearly isn't production-worthy. In the long run, I suspect that tight integration between a language like R and an RDBMS like Postgres would prove hugely useful, in much the same way that the tight integration between AOLserver and Tcl has proven useful. But that's blue-sky speculation on my part.)
- What sort of kludges/hacks do PostgreSQL users use for
manipulating data that has an inherent time-series order to it?
For example, in Oracle you can use the RowNum pseudo-column to do some ugly but potentially useful stuff. E.g., say I have a table of stock prices, and I want to calculate (today's stock price) / (yesterday's stock price). To do this in SQL means doing a self-join and shifting the second table one day forward in time. Here's an ugly Oracle hack to do exactly that:
select d1.id ,d1.row_num ,d2.row_num as prev_row_num ,d1.day_date ,d2.day_date as prev_day_date ,d1.price ,d2.price as prev_price ,(d1.price / d2.price) -- Note that below, test_data_view is a view which is already doing an -- order by - this is IMPORTANT. "Oracle8 The Complete Reference" -- claims you can't use an order by in a view like that, but in -- practice it works fine - the RowNum is applied as the rows come out -- of the view, AFTER the order by has been applied. from ( select rownum as row_num ,d.* from test_data_view d where d.id = 0 and d.day_date >= '2000-01-01' and d.day_date <= '2000-03-31' ) d1, ( select rownum as row_num ,d.* from test_data_view d where d.id = 0 and d.day_date >= '2000-01-01' and d.day_date <= '2000-03-31' ) d2 where d1.id = d2.id and d1.row_num = (d2.row_num + 1) ;
Any and all war stories, sage advice, or other info will be much appreciated. :)
1. No, PG does not have materialized views, but you can sort of
simulate them with triggers (probably not what you want to do,
but if you really need the efficiency of materialized views, that's
really what you're going to have to do).
2. You are right on. PG does not currently allow you to return REF
cursors, but will eventually allow you to return an arbitrary
relation, which is effectively what you're talking about, and brings
a certain amount of closure to plpgsql that LISP lovers dream
about every night.
3. you got me. Sounds like you know as much as I do on that
4. uggh. That's ugly. So what you're asking is, in PG, can you do
something as ugly as what you just did in Oracle? :) There is no
pseudo-column like rownum in PG, so I'd be tempted to say that
you can't do that. However, there are some monster SQL
hackers in this community (DanW? Don?) who might be able to
pull some amazing trick to do just this... anyone?
3. You're right about aggregates being very cool. I've not played with them other than test cases because there's no equivalent in Oracle or any other RDBMS I'm aware of, so they're not portable and therefore I've avoided looking for uses in OpenACS. But they're still awfully cool!
4. There's no direct way to do this in PG, same as in standard SQL. It's only working in Oracle due to sheer luck, after all...
2. PG 7.1 and below can't return cursors. As Don noted, 7.2 allows you to do that.
3. Don't know much in that area. Just one thing, just because something is not 'thread-safe' does not mean it's not production-worthy. Threads con be good (AOLserver) and can be bad, depending on the situation. PostgreSQL isn't threaded, for example. It's process-based.
In the past, I've had to do queries of the sort you mention. I've used a query like the following to do what you describe:
SELECT today.id, today.price AS todays_price, yesterday.price AS yesterdays_price, (today.price - yesterday.price) AS price_diff, DECODE( yesterday.price, 0, NULL, today.price/yesterday.price) as pct_change, FROM stocks today, stocks yesterday WHERE today.id = yesterday.id AND TRUNC(today.day_date) = TRUNC(:incoming_date) AND TRUNC(yesterday.day_date) = TRUNC(:incoming_date) - 1 AND today.id = :incoming_id
The values prefixed with colons are the input values (bind variables). If I were writing it for a production system, I'd modify the yesterday.day_date comparison to look for the most recent date before the :incoming_date rather than simply subtracting a day. The DECODE will deal with the possibility of a division by zero. There are some other things I'd consider doing, but it'd very dependent on what I needed to for the particular application. In any case, it sure seems cleaner than the rownum version. I must also say that I didn't try the above query directly (it's from memory), but I think the spirit of it is right.
I'd also be very surprised if Postgres couldn't do this query, making only minor changes to function names.
The primary cost of this is that connection startup is slow. In our case, running under AOLserver, we don't care 'cause we're pooling persistent connections and only pay the price once.
Daryl's query would run nearly unchanged in PG, as he suspects. You could minimize the change necessary by using SQL92-standard CASE rather than DECODE, because PG and other SQL92-compliant RDBMS's support it (as does Oracle as of 8.1.6 or so).
the job, but it is not as general as the nasty hack I posted above.
You're doing the join with date arithmetic, by subtracting 1 day from
the date. That works fine as long as the previous date is always
exactly 1 day in the past. But if you have a price on a Monday, and
the previous price was on the Friday before, that's not going to do
Hm, maybe a true geek would come up with some cool mathematical
transform to map the dates you actually have to a series of integers
without any gaps, kind of like the way the Julian Day Numbers are
series of integers without gaps, and each integer maps 1-to-1 to each
day in the calendar.
Don, that's good to know about non-thread-safe languages being ok to
embed in Postgres - I didn't realize that. I saw comments in the Docs
about how the libpq C library is thread safe as of Postgres 7.0, and
the guy who embedded R (Duncan Lang) seemed to highlight its
non-thread-safety as an issue, so I just assumed it was a problem.
(Now I'll have to put embedding language interpretors into Postgres on
my list of Cool Things to Learn...)
If the guy embedding this language felt that this indicated problems with thread-safety on the server side, he's wrong.
On the other hand, if he's providing client as well as server side services then he needs to make sure he's calling the thread-safe subset of PG lib.
My guess, though, is that he's muddling potential issues on the client side with the restrictions placed on server-side embedded languages.
to compute the comparison between the current row and the previous
row based on some variables you set at the end of each iteration in
your program instead of in the database.
However, if you want or need to do nearly the same thing in postgres
you do in oracle you could try using a postgres sequence to number
create sequence test_data_view_order_seq;
create table test_data_view_order (row_num int, whatever_field
insert into test_data_view_order (row_num,whatever_field)
from (select id from test_data_view order by whatever_field)
Then, in your ugly query, join test_data_view_order to
test_data_view based on whatever_field where whatever_field is the
field you are ordering by. This would be the uglier postgres hack
to replace to ugly oracle hack.
Actually we'll want to support 8.1.6 for a long time so unfortunately we'll not be making any use of either new feature for some time. But it is still good to know that aggregates are there.
Maybe the addition of aggregates is why installation now supposedly requires 512 MB RAM rather than only 128 MB!
The query you originally posted also fails to be sufficiently general to adapt to the situation where the previous day might be more than one day off. (Your day_date comparisons in the pseudo-tables only look for adjacent days.) That's why I indicated I would make adjustments depending on the particulars of the situation.
For example, one could make changes like the following to get the previous date for which a row exists:
SELECT today.id, today.price AS todays_price, yesterday.price AS yesterdays_price, (today.price - yesterday.price) AS price_diff, DECODE( yesterday.price, 0, NULL, today.price/yesterday.price) as pct_change, FROM stocks today, stocks yesterday WHERE today.id = yesterday.id AND TRUNC(today.day_date) = TRUNC(:incoming_date) AND TRUNC(yesterday.day_date) = (SELECT MAX(inner.day_date) FROM stocks inner WHERE inner.id = today.id AND stocks.day_date < TRUNC(:incoming_date) ) AND today.id = :incoming_id
or, to get data for the last business day, including accounting for holidays (stored in a table named holiday, something like:
SELECT today.id, today.price AS todays_price, yesterday.price AS yesterdays_price, (today.price - yesterday.price) AS price_diff, DECODE( yesterday.price, 0, NULL, today.price/yesterday.price) as pct_change, FROM stocks today, stocks yesterday WHERE today.id = yesterday.id AND TRUNC(today.day_date) = TRUNC(:incoming_date) AND TRUNC(yesterday.day_date) = ( SELECT MAX(dt) AS last_business_day FROM ( SELECT DECODE( TO_CHAR( :incoming_date, 'dy'), 'fri', :incoming_date, (NEXT_DAY(:incoming_date - 7 ,'fri') - ROWNUM + 1) AS dt FROM all_objects WHERE ROWNUM <= 25 ) WHERE TO_CHAR(dt,'dy') NOT IN ( 'sat', 'sun' ) AND NOT EXISTS ( SELECT NULL FROM holiday WHERE holiday_date = TRUNC(dt) ) ) ) AND today.id = :incoming_id
The last-business-day calculation could easily be moved into a view or stored procedure, in both Oracle and PostgreSQL.
(Note: Regarding the last solution, I can't take credit for the date calculation structure. I got it from Tom Kyte's AskTom web site. The rownum/all_objects combination is purely a row-generator to get a list of dates. The page I just referenced gives more details. I'll also plug Tom's new book. It's the best thing I have yet to read about Oracle, and I own some 15 books on Oracle.)
I have concerns, Andrew, about the solution you initially gave simply because it seems both fragile and non-portable, at a conceptual level. It's also using a behavior that Oracle doesn't guarantee (ORDER BY within a view). The solution built on date arithmetic, while syntactically proprietary, moves somewhat easily from database to database. Additionally, for a database like Oracle, the queries I've just given should make good use of the optimizer, and it should recognize that the date arithmetic subquery is invariant for the outer query and need only be done once.
The query you originally posted also fails to be sufficiently general to adapt to the situation where the previous day might be more than one day off. (Your day_date comparisons in the pseudo-tables only look for adjacent days.)
Daryl, I don't think so. Well, it wouldn't be the first time I've been wrong, but I'm pretty sure my ugly Oracle query works exactly as I originally described. Since it assigns a continuous sequence of integers to the days actually present, and joins on those integers, not on dates, it always gets the previous day, no matter how long ago the previous day was. E.g.:
ROW_NUM PREV_ROW_NUM DAY_DATE PREV_DAY_D PRICE PREV_PRICE ------- ------------ ---------- ---------- ----- ---------- 5 4 2000-01-07 2000-01-06 101 101 6 5 2000-01-10 2000-01-07 101 101
Now, the rest of your comments and concerns are quite valid. I came up with that ugly query while experimenting, I would think more than twice before actually putting something like that into production use.
I've also learned to be wary of pushing pieces of queries into PL/SQL procedures. That often makes the query clearer, so I like doing it during development. But I've seen that result in queries that are slower by an order of magnitude or more, I think because PL/SQL is completely opaque to the query optimizer.
Hm, Daryl, could you expand on your comment about "It's also using a behavior that Oracle doesn't guarantee (ORDER BY within a view)."? Do you know any more specifics about that?
I think Oracle does guarantee that "order by" works when defining a materialized view, so it would be pretty weird if it didn't work when defining an on the fly view. Then again, assuming that normal and materialized views always work the same way would be pretty foolish on my part...
Yeah, the holiday/calendar stuff you posted is smart. I was actually doing something similar inside my "test_data_view", in order to filter out any bogus dates (prices when the market was not open), and optionally fill in values for any missing days. Compared to my RowNum hack, that sort of thing is downright elegant. :)
But, gosh, you're right. I can use that same holiday/calendaring strategy to replace my RowNum hack - thanks for pointing that out. E.g.:
-- We can do the same query using max() instead of the RowNum -- pseudo-column. This might be considered clearer, and more correct. select d.id ,d.day_date ,dp.day_date as prev_day_date ,d.price ,dp.price as prev_price ,(d.price / dp.price) from test_data_view d -- Today. ,test_data_view dp -- Previous day. where d.id = 0 and d.id = dp.id and d.day_date between '2000-01-01' and '2000-03-31' and dp.day_date between '2000-01-01' and '2000-03-31' and dp.day_date = ( select max(d2.day_date) from test_data_view d2 where d.day_date > d2.day_date ) ;
But guess what? The above query is much slower - about 75 vs. 3 seconds. I didn't look at the query plans, but I'd guess that this lousy performance is due to one or both of these factors:
We are joining views which themselves are doing joins, and
Oracle is known to sometimes be very bad at optimizing this.
Flattening everything out into a single SQL query or view
definition would probably help.
- Joining on "max(d2.day_date)" is less specific than joining on "(d2.row_num + 1)", so it might be a lot slower unless we've got an index on day_date (which we do), and a really smart query optimizer.
Oh well. You know, all these solutions look like pretty brittle hacks to me. I suspect what we're really learning here, is that Oracle and standard SQL just don't support ordered data very well at all. As David W. pointed out, in a lot of these cases it would probably be better just to do calculations like (price / previous price) outside the database.
> slower by an order of magnitude or more, I think because PL/SQL is
> completely opaque to the query optimizer.
Yep. Also, with pl/sql functions, you're crossing boundry from sql query processing to pl/sql function execution. PL/Sql is fundamentally an interpreted language - the code gets tokenize and the tokens interpreted, so you have that overhead as well.
Tables were analyzed and all that. The query optimizer was just picking a very, very bad execution plan for the Grand Monolithic Version of the query.
I'm sure I could've gotten equivalent execution times with the single query approach with the right set of optimizer hints but splitting out pieces into PL/SQL functions was easier and they were useful elsewhere, too so what the heck...
PL/Sql is fundamentally an interpreted language - the code gets tokenized and the tokens interpreted, so you have that overhead as well....but that's all cached, right? And with 9i, you can natively compile PL/SQL too, just flick the switch...
Steven Feuerstein, the author of numerous PL/SQL books (the Ants book, for one, as well as his PL/SQL Best Practices book) from O'Reilly, heavily advocates explicit cursors for maintainability reasons. Most of the cursors I've written are for queries that are a bit more than the trivial ones Tom Kyte usually shows in his examples. For such queries, I find the code to be more readable if the query is relocated to a declared cursor in the package/procedure header, with a cursor name that really gives one an idea of what's being returned. This is especially so with a parameterized cursor. I don't really want to have to create views for each of these.
However, Tom Kyte makes some valuable points, so I'm sure I'll be re-evaluating Feuerstein's recommendations in light of Kyte's arguments.
I just use Vim's split screen feature to leave the cursor definitions open in the top window, while scrolling the code at the bottom. I'd have to split the screen anyway if the query in question spans lots of line (as many of mine do) -- having the LOOP six or seven lines away from the FOR is a REAL readability disaster for me.
The use of explicit cursors also allows me to define variables and parameters of the row type. For example:
CREATE PACKAGE BODY mypackage IS CURSOR mycur IS SELECT col1, col2, col3 FROM mytable; PROCEDURE cleanup_row( p_row IN mycur%ROWTYPE ) IS BEGIN -- do something with p_row END cleanup_row; BEGIN FOR myrec IN mycur LOOP cleanup_row( myrec ); END LOOP; END mypackage;
As another point, on a recent code audit resulting from several key changes in our data model, I had to look through all our PL/SQL procedures to find the affected queries. Those that were declared as parameterized cursors in the header of each program block were LOTS easier to find and fix than those that made use of program variables (e.g., SELECT ... FROM ... WHERE colname = varname).
Basically, I guess I like the fact that explicit, parameterized cursors allow me to treat queries as functions that return result sets.
As you say, a lot of it's a matter of taste. However, I'm not convinced that performance should be the sole criterion here. I really value Mr. Kyte's thoughts on the matter, but he only rarely uses queries of any significant size in his examples. That, and I don't want to create views for queries that are used in exactly one program unit.
For some reason, your original query just wouldn't leave me alone. Before I go on, I owe you an apology. What I didn't realize in your query is that you're actually producing a table of values. In that event, your self-join is hardly "ugly". In fact, it qualifies as elegant.
However, looking at a couple of Oracle books for some brain-joggers led me to some functions I had not used before: LEAD and LAG . These are intended to do exactly what you need to do.
I created a table:
CREATE TABLE dlb_stocks ( id NUMBER NOT NULL, day_date DATE NOT NULL, price NUMBER(18,3) NOT NULL, PRIMARY KEY (id, day_date ) );
and populated it with 720,000 rows...
BEGIN FOR i IN 1..10000 LOOP INSERT INTO dlb_stocks (id, day_date, price) SELECT i, TRUNC(SYSDATE) - (ROWNUM - 1), MOD( object_id, 9 ) * 10 FROM all_objects WHERE rownum < 100; END LOOP; END; / -- get rid of Saturday and Sunday values DELETE from dlb_stocks WHERE TO_CHAR( day_date, 'dy') IN ('sat','sun'); COMMIT;
And analyzed it to let the cost-based optimizer "be all it can be":
ANALYZE TABLE dlb_stocks COMPUTE STATISTICS FOR TABLE FOR ALL INDEXES FOR ALL INDEXED COLUMNS;
And ran the new and improved query:
column id format 99999 heading 'ID' column todays_date format a9 heading 't_date' column todays_price format 999.9 heading 't_price' column yesterdays_date format a9 heading 'y_date' column yesterdays_price format 999.9 heading 'y_price' column price_diff format 999.9 heading 'Diff' column price_change format 0.90 heading 'Ratio' SELECT id, day_date as todays_date, price AS todays_price, LEAD( day_date, 1, NULL) OVER (ORDER BY day_date DESC) AS yesterdays_date, LEAD( price, 1, 0 ) OVER (ORDER BY day_date DESC) AS yesterdays_price, (price - LEAD( price, 1, 0) OVER (ORDER BY day_date DESC)) AS price_diff, TO_CHAR(DECODE( LEAD( price, 1, 0) OVER (ORDER BY day_date DESC), 0, NULL, price / ( LEAD( price, 1, 0) OVER (ORDER BY day_date DESC) ) ), '90.00') AS price_change FROM dlb_stocks WHERE id = 1 ORDER BY day_date DESC;
And got the following output
ID t_date t_price y_date y_price Diff Ratio ------ --------- ------- --------- ------- ------ ------------------ 1 15-OCT-01 50.0 12-OCT-01 70.0 -20.0 0.71 1 12-OCT-01 70.0 11-OCT-01 80.0 -10.0 0.88 1 11-OCT-01 80.0 10-OCT-01 20.0 60.0 4.00 ...
With some gorgeous statistics:
71 rows selected. Elapsed: 00:00:00.11 Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=4 Card=71 Bytes=852) 1 0 WINDOW (BUFFER) 2 1 TABLE ACCESS (BY INDEX ROWID) OF 'DLB_STOCKS' (Cost=4 Ca rd=71 Bytes=852) 3 2 INDEX (RANGE SCAN DESCENDING) OF 'SYS_C004382' (UNIQUE ) (Cost=3 Card=71) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 4 consistent gets 0 physical reads 0 redo size 6444 bytes sent via SQL*Net to client 1207 bytes received via SQL*Net from client 9 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 71 rows processed
My only response was sweeeet. I suspect you'd see even better results by specifying ORGANIZATION INDEX when building the table or by constructing an index on id/price/day_date so that the index alone is sufficient to read the data you want.
I know you didn't really intend for your query to generate so much interest, but this particular solution seems to perform VERY well. Do you agree?
Although I know little about it, I suspect the vector-oriented kdb database has similar order-aware functionality, although derived from the APL world, rather than from SQL extensions hacked onto the fundamentally un-ordered relational model. Wippler's Vlerq may be going in a similar direction, although from yet another background.
I've heard that there is some academic work out there on database models which natively understand order, and form more powerful supersets of Codd's relational model. I would have called that a "vector relation model" or an "ordered relation model", but bizarrely, apparently what it's actually called is a "set model"! I have not read it, but The Set Model for Database and Information Systems by Mikhail M. Gilula was recommended to me on that subject.