Search · Index

VI.3 Data modeling

Data modeling is the hardest and most important activity in the RDBMS world. If you get the data model wrong, your application might not do what users need, it might be unreliable, it might fill up the database with garbage. Why then do we start a SQL tutorial with the most challenging part of the job? Because you can't do queries, inserts, and updates until you've defined some tables. And defining tables is data modeling.

When data modeling, you are telling the RDBMS the following:

  • what elements of the data you will store
  • how large each element can be
  • what kind of information each element can contain
  • what elements may be left blank
  • which elements are constrained to a fixed range
  • whether and how various tables are to be linked

Three-Valued Logic

Programmers in most computer languages are familiar with Boolean logic. A variable may be either true or false. Pervading SQL, however, is the alien idea of three-valued logic. A column can be true, false, or NULL. When building the data model you must affirmatively decide whether a NULL value will be permitted for a column and, if so, what it means.

For example, consider a table for recording user-submitted comments to a Web site. The publisher has made the following stipulations:

  • comments won't go live until approved by an editor
  • the admin pages will present editors with all comments that are pending approval, i.e., have been submitted but neither approved nor disapproved by an editor already
Here's the data model:

create table user_submitted_comments (
comment_id integer primary key,
user_id not null references users,
submission_time date default sysdate not null,
ip_address varchar(50) not null,
content clob,
approved_p char(1) check(approved_p in ('t','f'))
Implicit in this model is the assumption that approved_p can be NULL and that, if not explicitly set during the INSERT, that is what it will default to. What about the check constraint? It would seem to restrict approved_p to values of "t" or "f". NULL, however, is a special value and if we wanted to prevent approved_p from taking on NULL we'd have to add an explicit not null constraint.

How do NULLs work with queries? Let's fill user_submitted_comments with some sample data and see:

insert into user_submitted_comments
(comment_id, user_id, ip_address, content)
(1, 23069, '', 'This article reminds me of Hemingway');

Table created.

SQL> select first_names, last_name, content, user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id;

------------ --------------- ------------------------------------ ------------
Philip Greenspun This article reminds me of Hemingway
We've successfully JOINed the user_submitted_comments and users table to get both the comment content and the name of the user who submitted it. Notice that in the select list we had to explicitly request user_submitted_comments.approved_p. This is because the users table also has an approved_p column.

When we inserted the comment row we did not specify a value for the approved_p column. Thus we expect that the value would be NULL and in fact that's what it seems to be. Oracle's SQL*Plus application indicates a NULL value with white space.

For the administration page, we'll want to show only those comments where the approved_p column is NULL:

SQL> select first_names, last_name, content, user_submitted_comments.approved_p
from user_submitted_comments, users
where user_submitted_comments.user_id = users.user_id
and user_submitted_comments.approved_p = NULL;

no rows selected
"No rows selected"? That's odd. We know for a fact that we have one row in the comments table and that is approved_p column is set to NULL. How to debug the query? The first thing to do is simplify by removing the JOIN:

SQL> select * from user_submitted_comments where approved_p = NULL;

no rows selected
What is happening here is that any expression involving NULL evaluates to NULL, including one that effectively looks like "NULL = NULL". The WHERE clause is looking for expressions that evaluate to true. What you need to use is the special test IS NULL:

SQL> select * from user_submitted_comments where approved_p is NULL;

---------- ---------- ------------ ----------
------------------------------------ ------------
1 23069 2000-05-27
This article reminds me of Hemingway
An adage among SQL programmers is that the only time you can use "= NULL" is in an UPDATE statement (to set a column's value to NULL). It never makes sense to use "= NULL" in a WHERE clause.

The bottom line is that as a data modeler you will have to decide which columns can be NULL and what that value will mean.

Back to the Mailing List

Let's return to the mailing list data model from the introduction:

create table mailing_list (
email varchar(100) not null primary key,
name varchar(100)

create table phone_numbers (
email varchar(100) not null references mailing_list,
number_type varchar(15) check (number_type in ('work','home','cell','beeper')),
phone_number varchar(20) not null
This data model locks you into some realities:
  • You will not be sending out any physical New Year's cards to folks on your mailing list; you don't have any way to store their addresses.
  • You will not be sending out any electronic mail to folks who work at companies with elaborate Lotus Notes configurations; sometimes Lotus Notes results in email addresses that are longer than 100 characters.
  • You are running the risk of filling the database with garbage since you have not constrained phone numbers in any way. American users could add or delete digits by mistake. International users could mistype country codes.
  • You are running the risk of not being able to serve rich people because the number_type column may be too constrained. Suppose William H. Gates the Third wishes to record some extra phone numbers with types of "boat", "ranch", "island", and "private_jet". The check (number_type in ('work','home','cell','beeper')) statement prevents Mr. Gates from doing this.
  • You run the risk of having records in the database for people whose name you don't know, since the name column of mailing_list is free to be NULL.
  • Changing a user's email address won't be the simplest possible operation. You're using email as a key in two tables and therefore will have to update both tables. The references mailing_list keeps you from making the mistake of only updating mailing_list and leaving orphaned rows in phone_numbers. But if users changed their email addresses frequently, you might not want to do things this way.
  • Since you've no provision for storing a password or any other means of authentication, if you allow users to update their information, you run a minor risk of allowing a malicious change. (The risk isn't as great as it seems because you probably won't be publishing the complete mailing list; an attacker would have to guess the names of people on your mailing list.)

These aren't necessarily bad realities in which to be locked. However, a good data modeler recognizes that every line of code in the .sql file has profound implications for the Web service.

To get some more information on how a simple datamodel for a Discussion Forum can evolve, read en:sql-wn-data_modeling-philip

Representing Web Site Core Content

Free-for-all Internet discussions can often be useful and occasionally are compelling, but the anchor of a good Web site is usually a set of carefully authored extended documents. Historically these have tended to be stored in the Unix file system and they don't change too often. Hence I refer to them as static pages. Examples of static pages on the photo.net server include this book chapter, the tutorial on light for photographers at http://www.photo.net/making-photographs/light.

We have some big goals to consider. We want the data in the database to

  • help community experts figure out which articles need revision and which new articles would be most valued by the community at large.
  • help contributors work together on a draft article or a new version of an old article.
  • collect and organize reader comments and discussion, both for presentation to other readers but also to assist authors in keeping content up-to-date.
  • collect and organize reader-submitted suggestions of related content out on the wider Internet (i.e., links).
  • help point readers to new or new-to-them content that might interest them, based on what they've read before or based on what kind of content they've said is interesting.
The big goals lead to some more concrete objectives:
  • We will need a table that holds the static pages themselves.
  • Since there are potentially many comments per page, we need a separate table to hold the user-submitted comments.
  • Since there are potentially many related links per page, we need a separate table to hold the user-submitted links.
  • Since there are potentially many authors for one page, we need a separate table to register the author-page many-to-one relation.
  • Considering the "help point readers to stuff that will interest them" objective, it seems that we need to store the category or categories under which a page falls. Since there are potentially many categories for one page, we need a separate table to hold the mapping between pages and categories.

create table static_pages (
page_id integer not null primary key,
url_stub varchar(400) not null unique,
original_author integer references users(user_id),
page_title varchar(4000),
page_body clob,
obsolete_p char(1) default 'f' check (obsolete_p in ('t','f')),
members_only_p char(1) default 'f' check (members_only_p in ('t','f')),
price number,
copyright_info varchar(4000),
accept_comments_p char(1) default 't' check (accept_comments_p in ('t','f')),
accept_links_p char(1) default 't' check (accept_links_p in ('t','f')),
last_updated date,
-- used to prevent minor changes from looking like new content
publish_date date

create table static_page_authors (
page_id integer not null references static_pages,
user_id integer not null references users,
notify_p char(1) default 't' check (notify_p in ('t','f')),

Note that we use a generated integer page_id key for this table. We could key the table by the url_stub (filename), but that would make it very difficult to reorganize files in the Unix file system (something that should actually happen very seldom on a Web server; it breaks links from foreign sites).

How to generate these unique integer keys when you have to insert a new row into static_pages? You could

  • lock the table
  • find the maximum page_id so far
  • add one to create a new unique page_id
  • insert the row
  • commit the transaction (releases the table lock)
Much better is to use Oracle's built-in sequence generation facility:

create sequence page_id_sequence start with 1;
Then we can get new page IDs by using page_id_sequence.nextval in INSERT statements (see the Transactions chapter for a fuller discussion of sequences).


Here is a summary of the data modeling tools available to you in Oracle, each hyperlinked to the Oracle documentation. This reference section covers the following:


based on  SQL for Web Nerds