94.76%
Search · Index

VI.3.1 The Discussion Forum -- philg's personal odyssey

Back in 1995, I built a threaded discussion forum, described ad nauseum in http://philip.greenspun.com/wtr/dead-trees/53013.htm. Here's how I stored the postings:

create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
email varchar(200),
name varchar(200),
one_line varchar(700),
message clob,
notify char(1) default 'f' check (notify in ('t','f')),
posting_time date,
sort_key varchar(600)
);
German order reigns inside the system itself: messages are uniquely keyed with msg_id, refer to each other (i.e., say "I'm a response to msg X") with refers_to, and a thread can be displayed conveniently by using the sort_key column.

Italian chaos is permitted in the email and name columns; users could remain anonymous, masquerade as "president@whitehouse.gov" or give any name.

This seemed like a good idea when I built the system. I was concerned that it work reliably. I didn't care whether or not users put in bogus content; the admin pages made it really easy to remove such postings and, in any case, if someone had something interesting to say but needed to remain anonymous, why should the system reject their posting?

One hundred thousand postings later, as the moderator of the photo.net Q&A forum, I began to see the dimensions of my data modeling mistakes.

First, anonymous postings and fake email addresses didn't come from Microsoft employees revealing the dark truth about their evil bosses. They came from complete losers trying and failing to be funny or wishing to humiliate other readers. Some fake addresses came from people scared by the rising tide of spam email (not a serious problem back in 1995).

Second, I didn't realize how the combination of my email alert systems, fake email addresses, and Unix mailers would result in my personal mailbox filling up with messages that couldn't be delivered to "asdf@asdf.com" or "duh@duh.net".

Although the solution involved changing some Web scripts, fundamentally the fix was add a column to store the IP address from which a post was made:


alter table bboard add (originating_ip varchar(16));
Keeping these data enabled me to see that most of the anonymous posters were people who'd been using the forum for some time, typically from the same IP address. I just sent them mail and asked them to stop, explaining the problem with bounced email.

After four years of operating the photo.net community, it became apparent that we needed ways to

  • display site history for users who had changed their email addresses
  • discourage problem users from burdening the moderators and the community
  • carefully tie together user-contributed content in the various subsystems of photo.net
The solution was obvious to any experienced database nerd: a canonical users table and then content tables that reference it. Here's a simplified version of the data model, taken from a toolkit for building online communities, describe in http://philip.greenspun.com/panda/community:

create table users (
user_id integer not null primary key,
first_names varchar(100) not null,
last_name varchar(100) not null,
email varchar(100) not null unique,
..
);

create table bboard (
msg_id char(6) not null primary key,
refers_to char(6),
topic varchar(100) not null references bboard_topics,
category varchar(200), -- only used for categorized Q&A forums
originating_ip varchar(16), -- stored as string, separated by periods
user_id integer not null references users,
one_line varchar(700),
message clob,
-- html_p - is the message in html or not
html_p char(1) default 'f' check (html_p in ('t','f')),
...
);

create table classified_ads (
classified_ad_id integer not null primary key,
user_id integer not null references users,
...
);
Note that a contributor's name and email address no longer appear in the bboard table. That doesn't mean we don't know who posted a message. In fact, this data model can't even represent an anonymous posting: user_id integer not null references users requires that each posting be associated with a user ID and that there actually be a row in the users table with that ID.

First, let's talk about how much fun it is to move a live-on-the-Web 600,000 hit/day service from one data model to another. In this case, note that the original bboard data model had a single name column. The community system has separate columns for first and last names. A conversion script can easily split up "Joe Smith" but what is it to do with William Henry Gates III?

How do we copy over anonymous postings? Remember that Oracle is not flexible or intelligent. We said that we wanted every row in the bboard table to reference a row in the users table. Oracle will abort any transaction that would result in a violation of this integrity constraint. So we either have to drop all those anonymous postings (and any non-anonymous postings that refer to them) or we have to create a user called "Anonymous" and assign all the anonymous postings to that person. The technical term for this kind of solution is kludge.

A more difficult problem than anonymous postings is presented by long-time users who have difficulty typing and or keeping a job. Consider a user who has identified himself as

  1. Joe Smith; jsmith@ibm.com
  2. Jo Smith; jsmith@ibm.com (typo in name)
  3. Joseph Smith; jsmth@ibm.com (typo in email)
  4. Joe Smith; cantuseworkaddr@hotmail.com (new IBM policy)
  5. Joe Smith-Jones; joe_smithjones@hp.com (got married, changed name, changed jobs)
  6. Joe Smith-Jones; jsmith@somedivision.hp.com (valid but not canonical corporate email address)
  7. Josephina Smith; jsmith@somedivision.hp.com (sex change; divorce)
  8. Josephina Smith; josephina_smith@hp.com (new corporate address)
  9. Siddhartha Bodhisattva; josephina_smith@hp.com (change of philosophy)
  10. Siddhartha Bodhisattva; thinkwaitfast@hotmail.com (traveling for awhile to find enlightenment)
Contemporary community members all recognize these postings as coming from the same person but it would be very challenging even to build a good semi-automated means of merging postings from this person into one user record.

Once we've copied everything into this new normalized data model, notice that we can't dig ourselves into the same hole again. If a user has contributed 1000 postings, we don't have 1000 different records of that person's name and email address. If a user changes jobs, we need only update one column in one row in one table.

The html_p column in the new data model is worth mentioning. In 1995, I didn't understand the problems of user-submitted data. Some users will submit plain text, which seems simple, but in fact you can't just spit this out as HTML. If user A typed < or > characters, they might get swallowed by user B's Web browser. Does this matter? Consider that "<g>" is interpreted in various online circles as an abbreviation for "grin" but by Netscape Navigator as an unrecognized (and therefore ignore) HTML tag. Compare the meaning of

"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest Americans combined. After all, he invented the personal computer, the graphical user interface, and the Internet."
with
"We shouldn't think it unfair that Bill Gates has more wealth than the 100 million poorest Americans combined. After all, he invented the personal computer, the graphical user interface, and the Internet. <g>"

It would have been easy enough for me to make sure that such characters never got interpreted as markup. In fact, with AOLserver one can do it with a single call to the built-in procedure ns_quotehtml. However, consider the case where a nerd posts some HTML. Other users would then see

"For more examples of my brilliant thinking and modesty, check out <a href="http://philip.greenspun.com/">my home page</a>."
I discovered that the only real solution is to ask the user whether the submission is an HTML fragment or plain text, show the user an approval page where the content may be previewed, and then remember what the user told us in an html_p column in the database.

Is this data model perfect? Permanent? Absolutely. It will last for at least... Whoa! Wait a minute. I didn't know that Dave Clark was replacing his original Internet Protocol, which the world has been running since around 1980, with IPv6 (http://www.faqs.org/rfcs/rfc2460.html). In the near future, we'll have IP addresses that are 128 bits long. That's 16 bytes, each of which takes two hex characters to represent. So we need 32 characters plus at least 7 more for periods that separate the hex digits. We might also need a couple of characters in front to say "this is a hex representation". Thus our brand new data model in fact has a crippling deficiency. How easy is it to fix? In Oracle:


alter table bboard modify (originating_ip varchar(50));
You won't always get off this easy. Oracle won't let you shrink a column from a maximum of 50 characters to 16, even if no row has a value longer than 16 characters. Oracle also makes it tough to add a column that is constrained not null.