Forum OpenACS Q&A: something like vacuum analyze or compact database for Oracle?

Hi!
I would like to know if there is a command or some instructions to
follow for Oracle in order to do something like vacuum analyze for PG
or compact the DB for Access,

Thanks

As far as I understand (and I use Oracle regularly) there is no such equivalent. The reason is that Oracle doesn't need to clean itself up -- you might say that its design is slightly superior in this respect.
Jade's more or less right ... Oracle uses an overwriting storage manager, which means that when you update a row it is physically overwritten.  This complicates concurrency stuff (queries have to read the REDO log to get the old version, as is necessary when another transaction started reading the table before yours started updating it) but keeps tables from growing as they're updated.

However, there's still the deletion problem.  Oracle will find free "holes" but a massive delete won't shrink your tablespace files, as I understand it.  You have to do something special and there is a way to do it but I don't know it :)

But I know there are people here who do ...

Jade and Don are right. There's no such thing as a "vacuum analyze" in Oracle.

As for Don's observation regarding the redo log, yes, Oracle's design means that readers asking for a row since modified by another user "read around" the altered row to get the original. However, the benefits far outweigh the expense: readers never wait for writers, and vice versa. This has a lot of implications, especially in that locks are no longer to be avoided as an "expense". One allows transactions to be as long or as short as they need to be, without worrying that you're blocking other users. For full details, see the Oracle 8i Concepts manual (the whole book is required reading for any Oracle developer). Alternatively, buy a copy of Thomas Kyte's Expert Oracle One-on-One. The latter is worth every cent.

Don is also right that big chunks of space resulting from a DELETE are not reclaimed. However, in a typical system, you'll be re-inserting those rows at some point. Trying to keep up with reclaiming excess space, only to watch it get allocated again later, is a pointless exercise in most shops. The underlying issue is that extents (the physical chunks of space allocated to a database object -- table, index, etc.), once allocated to an object (a.k.a. a segment), remain allocated, even if they're not used.

To get rid of those not-currently-used extents, you can either DROP or TRUNCATE the table to reclaim the space. So, one might try:

  CREATE TABLE my_temporary_holding_table AS SELECT * FROM mypurgedtable;
  
  TRUNCATE TABLE mypurgedtable;

  INSERT INTO mypurgedtabled SELECT * FROM my_temporary_holding_table;
 
  DROP my_temporary_holding_table;

If the purged table has is the parent table for any referential integrity constraints, you'll need to disable the constraints before the TRUNCATE and to reenable them following the INSERT. See the syntax for ALTER TABLE.

Finally, it is possible to resize a datafile in Oracle. (Any single tablespace is composed of one or more datafiles.) If memory serves, the syntax is ALTER DATABASE DATAFILE filename RESIZE newsize. This statement assumes the used extents in the datafile all fit in the new size. I don't do this very often, though.

Actually, PG's non-overwriting storage manager give exactly the same benefits as Oracle's REDO-log reading of old tuple contents, but at much lower cost.  MVCC (writers don't block readers, writers only block the rows they're writing to other writers) is almost a trivial feature of PG, whereas it is complex in the overwriting storage manager scenario.

The downside is that PG suffers when tables are subjected to frequent updates and deletes  - tables grow, and VACUUM ANALYZE is the result.

The PG group are working on two fronts:

1. Investigating overwriting storage manager algorithms (with the additional complexities implemented by Oracle to maintain proper row-level locking behavior)

2. Looking into what might be called hybrid techniques.  PG 7.2 actually includes a first cut at a "low-cost" VACUUM command.  There are a bunch of areas for exploration in this area.

I look at #1 as being a tried-and-proven approach that sacrifices much of PG's inherent simplicity.  Not necessarily a bad thing, but the current simplicity is awfully attractive.

Option #2 involves creative work, which AFAIK has not been fully explored in academia or industry, so is more innovative, which means that grabbing the best of both worlds may take a relatively long time.

So the future will be interesting ...

The interesting thing is that whichever approach they take, they have a good chance of overtaking Oracle at this level.  In a sense it's not fair - early RDBMS folks really had no choice but to implement overwriting storage managers, because storage was *expensive*.  Now, all that one worries about with PG is whether or not the cost of a VACUUM once a night is too expensive for a 24*7 website.  Disk storage for most of us is as nearly zero-cost as one can imagine (compared to my days purchasing DEC 200 MB washing machine drives for *choke* $36K.  OK I never paid that, but it was list and many did).

Don, can you provide a pointer to a discussion of how PG accomplishes this?  I may be interested in replicating a bit of that technology in a project I am working on.

Thanks,

As for Don's observation regarding the redo log, yes, Oracle's design means that readers asking for a row since modified by another user "read around" the altered row to get the original.
I think you mean "rollback segment" instead of redo log in the paragraph I'm quoting. Oracle uses rollback segments for transactional behavior (the "A" and "I" in ACID), and redo logs to ensure that committed transactions do not get lost (the "D" in ACID).
Sebastiano, thanks for mentioning that. About 30 minutes after I posted that, I slapped my head and yelled "Doh!" for mentioning redo logs instead of the more correct rollback segment. Like Jerry, I'd like to find out more about the underlying PostgreSQL algorithms for this. In particular, what is PostgreSQL's behavior on something like this:
SQL> DROP TABLE t;

Table dropped.

SQL> CREATE TABLE t AS SELECT table_name FROM user_tables
  2  WHERE table_name like 'DR$DOCUMENT%';

Table created.

SQL>
SQL> SELECT COUNT(*) FROM t;

  COUNT(*)
----------
        12

SQL>
SQL> set serveroutput on size 1000000
SQL> DECLARE
  2   CURSOR v_cur IS
  3    SELECT table_name from t;
  4   v_row v_cur%ROWTYPE;
  5   v_count NUMBER DEFAULT 0;
  6  BEGIN
  7   OPEN v_cur;
  8
  9   DELETE FROM t;
 10   COMMIT;
 11
 12   LOOP
 13    FETCH v_cur INTO v_row;
 14    IF v_cur%NOTFOUND THEN
 15     EXIT;
 16    END IF;
 17    dbms_output.put_line( v_row.table_name );
 18    v_count := v_count + 1;
 19   END LOOP;
 20   dbms_output.put_line( TO_CHAR(v_count) || ' rows retrieved');
 21
 22   CLOSE v_cur;
 23  END;
 24  /
DR$DOCUMENT_BODY$I
DR$DOCUMENT_BODY$K
DR$DOCUMENT_BODY$N
DR$DOCUMENT_BODY$R
DR$DOCUMENT_SUBJECT$I
DR$DOCUMENT_SUBJECT$K
DR$DOCUMENT_SUBJECT$N
DR$DOCUMENT_SUBJECT$R
DR$DOCUMENT_URL$I
DR$DOCUMENT_URL$K
DR$DOCUMENT_URL$N
DR$DOCUMENT_URL$R
12 rows retrieved

PL/SQL procedure successfully completed.

SQL>

If you're wondering why this code is special, notice the COMMIT prior to retrieving anything from the cursor.

I don't know PG, so I can't answer Daryl's question. However, regarding the Oracle code, note that the cursor's query is executed when you open the cursor, not when you fetch from it (FYI, this is documented in Oracle's PL/SQL User's Guide and Reference). So it's normal that the cursor retrieves the rows from the table even if they've been deleted immediately after the open statement.
Sebastiano, according to the Oracle PL/SQL documentation, Oracle only identifies the rows the meet the criteria when the OPEN is executed, it does not retrieve them until the FETCH. In the example above, the OPEN is performed, then the same rows are deleted and committed. Where is the FETCH getting its data?

The following sample highlights that the retrieval of data is only done at the FETCH:

SQL> set serveroutput on size 1000000
SQL> DECLARE
  2   CURSOR v_cur IS
  3    SELECT name FROM all_source;
  4   v_rec v_cur%ROWTYPE;
  5   v_rowcount NUMBER DEFAULT 0;
  6  BEGIN
  7   dbms_output.put_line( 'Start time: ' || to_char(sysdate,'hh24:mi:ss') );
  8   OPEN v_cur;
  9   dbms_output.put_line( 'After OPEN: ' || to_char(sysdate,'hh24:mi:ss') );
 10  
 11   LOOP
 12    FETCH v_cur INTO v_rec;
 13    IF v_cur%NOTFOUND THEN
 14     EXIT;
 15    END IF;
 16    v_rowcount := v_rowcount + 1;
 17   END LOOP;
 18  
 19   dbms_output.put_line( 'After FETCHes: ' || to_char(sysdate,'hh24:mi:ss') );
 20  
 21   CLOSE v_cur;
 22  
 23   dbms_output.put_line( 'After CLOSE: ' || to_char(sysdate,'hh24:mi:ss') );
 24   dbms_output.put_line( 'Rows retrieved: ' || to_char( v_rowcount ) );
 25  
 26  END;
 27  /
Start time: 08:56:14
After OPEN: 08:56:14
After FETCHes: 08:56:46
After CLOSE: 08:56:46
Rows retrieved: 52517

PL/SQL procedure successfully completed.

In this case, Oracle identified the rows at the OPEN, but didn't retrieve them. The DELETE/COMMIT removed them from the database. Yet the FETCH is able to retrieve rows that no longer exist. The timings highlight where the work is being done. There's no way OPEN could move 50,000+ rows in about a second (not on my system, anyway).

To quote from the Oracle docs: "Rows in the result set are not retrieved when the OPEN statement is executed. Rather, the FETCH statement retrieves the rows."

Even if/though it fetches them after it deletes them, it still fetches them as of
their contents before they were deleted.  Oracle picks a moment in time for
the select query and all rows are selected with what their contents where at
that time.

A separate query running just after the commit will show all the rows deleted.

Sebastiano's correct.  As to how it gets them it gets them the same way it would get them in a concurrent session which started a SELECT before you did the DELETE/COMMIT statements - the rollback segment.

This is where the complexity lies in implementing an overwriting storage manager, and why there's a certain appeal in the non-overwriting storage manager used by PostgreSQL (offset by the need to reclaim space).

As to how PG implements transaction semantics, each tuple is tagged with transaction id information that identifies the transaction "window" (my term) within which the tuple is visible.  The transaction id is bumped for each new transaction.  Deleting a tuple involves setting its "window" such that the tuple is no longer visible after the transaction zapping it.  If you think about this implementation you can see why high concurrency and row-level locking was fairly easy to implement on top of it.

VACUUM runs through the table looking for "dead" tuples, i.e. tuples that can't be seen by any currently running or future transaction.  In  older PG (and PG 7.2's VACUUM FULL command) the table's compacted and unneeded blocks thrown away.  PG 7.2's new, lighter-weight VACUUM just frees up space within blocks rather than compact the entire table, and maintains a free list for each table which allows it to reuse the space when future tuples are inserted.  It only throws away totally empty blocks at the end of the table and only if no other backend has locked the block.  At least that's my understanding, I haven't studied it in detail.

David, your conclusion is my point. Even though the rows are gone, the cursor is still able to get the rows.

Taking my previous example as a starting point, I broke it up into two separate sessions. (Session 1 and Session 2 are completely different connections to the database. Oracle sees them as two concurrent users.)


--================
-- Session 1
--================

SQL> CREATE TABLE t AS SELECT table_name FROM user_tables
  2  WHERE table_name like 'DR$DOCUMENT%';

Table created.

SQL> 
SQL> SELECT COUNT(*) FROM t;

  COUNT(*)
----------
        12
r


--================
-- Session 2
--================

SQL> variable x refcursor
SQL> begin
  2  open :x for select * from t;
  3  end ;
  4  /

PL/SQL procedure successfully completed.


--================
-- Session 1
--================

SQL> delete from t;

12 rows deleted.

SQL> commit;

Commit complete.

SQL> 


--================
-- Session 2
--================

SQL> print x

TABLE_NAME
--------------------------------
DR$DOCUMENT_BODY$I
DR$DOCUMENT_BODY$K
DR$DOCUMENT_BODY$N
DR$DOCUMENT_BODY$R
DR$DOCUMENT_SUBJECT$I
DR$DOCUMENT_SUBJECT$K
DR$DOCUMENT_SUBJECT$N
DR$DOCUMENT_SUBJECT$R
DR$DOCUMENT_URL$I
DR$DOCUMENT_URL$K
DR$DOCUMENT_URL$N

TABLE_NAME
--------------------------------
DR$DOCUMENT_URL$R

12 rows selected.

The above steps were performed in the order indicated. Databases that offer transaction control but don't perform some kind of multi-versioning (as Oracle does, and, as I'm starting to think PG does), would not have allowed Session 1 to delete the rows, as the shared locks held by Session 2 would have prevented it. Alternatively (assuming a lower isolation level in the same non-versioning database), Session 1 might have been able to delete the rows, but then what would Session 2 see? For non-versioning databases, where would the data in the final action come from? The rows are gone (having been committed), but non-versioning databases don't keep track of what the data looked like at the start of the query. (That is, they don't attempt to get things out of a rollback segment.) That's why they block in these situations at normal isolation levels (repeatable read in particular).

None of this really affects PG in any way. I posted my first comment to shed some light on Oracle's behavior. (Many people still work with an understanding of transactions and concurrency that is the textbook "shared locks block the acquisition of exclusive locks by other sessions" model.) The original code sample I posted is exactly the behavior expected of Oracle (and perhaps of PG, but no one's answered that question). But it is emphatically not the behavior one expects out of, say, Sybase (circa System X anyway). Most transactional databases still use non-versioning concurrency control. (PG increasingly looks to be the exception, a prospect I find exciting. Sadly, I've not found a PostgreSQL Concepts manual comparable to Oracle's.)

I'm looking forward to seeing how PG handles some of these samples. Don's right in that multi-versioning is what dramatically increases concurrency.

But at this point, I'm getting confused as to what, if any issue, there is with what I said? Sebastiano's tone is one of correction, but the statement "So it's normal that the cursor retrieves the rows from the table even if they've been deleted immediately after the open statement," doesn't apply to all databases universally. That fact is why I went on to explain Oracle's reasoning behind using the rollback segment in the first place.

Yes, PG implements multi-versioning, and the semantics are very similar to Oracle's though the implementation is very, very different.

It was the introduction of multi-versioning in PG 6.5 and the removal of fsynching the transaction log on read-only transactions that in combination convinced me that PG was usable in reasonably busy web/db environments (I'd rejected 6.4 and earlier versions as being unusable in practice).

As far as what was "wrong" with your statement(s) I think you misunderstood the meaning of Sebestiano's statement.  The cursor query is indeed executed when you open it, not when you fetch it, and your timings showing where the "work" is done doesn't counter that fact.

This isn't necessarily a high-cost step in Oracle because it is essentially just marking a transaction ID of some sort that is used to figure out what data to return when the fetch is executed.  I don't know exactly how Oracle does this internally (i.e. the actual form the transaction ID takes) but conceptually it is easy to understand.

Don, whether or not the query is executed, the rows have to be retrieved. If the rows are deleted, then the database must pull them from somewhere, as the rows identified at query "execution" haven't been retrieved at the OPEN. (Retrieval only happens at the FETCH.) The whole point of my original statement is that the database must either refuse the delete or invalidate the cursor if no versioning is supported. Oracle's implementation is a versioning one built around its use of rollback segments.

My final point here is that the behavior I've demonstrated is in fact only possible if a versioning database is used. I'm glad to hear that PG is doing versioning, albeit with a different (and possibly better) approach.

My final point here is that the behavior I've demonstrated is in fact only possible if a versioning database is used.

Not at all. The straightforward implementation is to generate the rowset generated by the query at cursor OPEN time, then FETCH rows from that. In fact the semantics of CURSOR, OPEN and FETCH demand that the implementation act as though this is true (A Guide to The SQL Standard, Date & Darwin).

So Oracle's implementation - "lazy evaluation" of CURSORs which delays work until FETCH time - is simply an optimization. The advantage of Oracle's "lazy evaluation" approach is that you don't pay the cost of fetching the full rowset associated with the CURSOR query unless you actually are using all the rows. If you just FETCH a few rows them close the cursor Oracle's approach saves you a lot of machine cycles.

And it will save a lot of RAM, too, if the query returns a lot of rows since the rowset doesn't need to be generated and stored in the db server, then FETCH'd a row at a time into your local variables.

PG implements MVCC but does not implement "lazy evaluation" of CURSOR select statements. This is one area where Oracle is superior to PostgreSQL, though it's an area that's not of much importance to most web/db hackers. Cursors in the toolkit are used to traverse the entire rowset in every case I've looked at. In PostgreSQL, the "right" way to get just a few rows from a query is to use the non-standard OFFSET and LIMIT clauses rather than manipulate a cursor.

I think Don's right about cursors being used to get the whole rowset, but there was one pl/sql function that I ported in the CR that used two cursors to walk up and down the folder structure in the CR to compute a relative path.  That was the only function in the openacs-4 kernel where having a cursor implementation in pg would have been nice - implementing it without cursors was a royal PIA.  In general the use of cursors in openacs-4 turned out to be a problem because most of the packages that used them forgot to close them, thus resulting in problems due to leaking cursors.
PG 7.2 has explicit cursors ... another reason to switch!

I forgot about all the leaked cursors ... I wonder how well that's been cleaned up?

At one point, I went through all of the oracle packages in the kernel and fixed all of the cursors that were left open.  I think we had a bboard discussion about this, so I assume that the same was done for all of the packages outside of the kernel, but I'm not sure.
But at this point, I'm getting confused as to what, if any issue, there is with what I said? Sebastiano's tone is one of correction, but the statement "So it's normal that the cursor retrieves the rows from the table even if they've been deleted immediately after the open statement," doesn't apply to all databases universally.
Daryl,
it's possible that I, English being not my native language, injected into my words a tone and a meaning that I didn't really want to convey. From reading your previous post, I got the impression that you were surprised by Oracle's behavior, and I wanted to explain that in Oracle what you saw was normal. Sorry for any misunderstanding.

You are perfectly right,of course, in saying that not all databases behave this way; I use Oracle every day, so I'm familiar with it, and have used InterBase in the past, but apart from these two I don't have experiences with other DBMS. And, come to think of it, I don't really have a clue about how InterBase behaves in this kind of situation, either.
InterBase has a storage manager that's conceptually similar to PostgreSQL's, and implements multi-versioning and row-level granularity for locks.

I don't know if they do "lazy evaluation" of cursors or not, but regardless, as I mentioned above, SQL92-compliant RDBMS implementations must behave as in Daryl's example. The standard leaves no wiggle-room on this one. They easiest way to implement the standard semantics is to just grab the rowset at cursor OPEN time - which is what PG does.

Don, do you know how PG behaves if, while retrieving rows during the OPEN, rows change? For example, if I OPEN a query like SELECT name, balance FROM accounts, while someone takes $500 from balance on row 1 (after I read it), and adds to to balance on row 5000000 (before I read it). Will PG see the change on row 5000000? Will it lock others out until it's done getting all 5000000 rows?