Data Modeling Considerations

Overview

This discussion reflects the ongoing evolution of ideas developed through comments here, email and chat discussions among a group within the OpenACS community interested in this project. It draws from stuff written by Malte, Timo, and Kolja here. It incorporates stuff from chats logged 2003-11-26 (and more to come -- next one scheduled for 2 Dec).

First is a pseudo-quasi-ERD/UML graphic of the overall structure of the Assessment package datamodel in its current state. This model began with the current "complex survey" datamodel. Second is a discussion of some general nomenclature issues, and third are more specific topics regarding the datamodel.

Draft Pseudo-quasi-ERD/UML

Well this looks rather confusing, as DB schemas always do, and it also isn't a real Peter Chen-style ERD but rather a quasi-UML-ERD hybrid (my apologies to the offended purists out there). Still, this notation rather more compactly tries to display the whole thing in one place. Refer to the following Discussion section for elaboration of specific issues raised in this model. The OmniGraffle file from which it was created is available here.

Here's a link to the current schema.

Nomenclature

Quite a few different terms have been used (survey, questionnaire, exam, study, etc). To make this as generic as we can (so presumably the package could be used to deploy polls, incorporated into ecommerce packages, etc), here are some suggested terms:

Prefix all tables in the package with "as_". Currently "survey_" prepends everything, and this uses up too many of the 32 char limit in Postgres identifier names (though I think PG 7.3+ increases this to 64 so that's not as big a deal, since presumably this Assessment package will require PG 7.3+).
Use "assessments" instead of "surveys" as the basic container in this package. This is just a name change, partly to help mark the evolution from the prior packages, and partly to reflect the more general purpose of this package than simply a "survey" which connote a particular type of assessment (and thus helping avoid confusion such as the difference between a "test" and a "survey" etc).
Use "items" instead of "questions" as the basic unit of a form. There a couple of reasons: one is the trivial and probably specious observation that many items literally are not questions (no question mark is used in the imperative: "Select your gender:"). The more important reason is that this term is used by some standards like CDISC. Since we're probably not going to adopt any standard's nomenclature in whole, this may be an unimportant justification. But items seems like a more generic and agreeable term to me.
Earlier I had proposed a four-level containment hierarchy, with a "forms" construct thus: Assessment->Form->Section->Item, but we've decided that is unnecessary and unwieldy. The basic containment hiearchy then is the same as it was in survey: Assessment->Section->Item; all that has changed is some naming: Assessment for Survey; Item for Question. Of course, it will be possible to present different naming schemes in the UIs for creating Assessments as well as deploying them to users; this can be handled via the internationalization tools.
Distinguish between metadata entities and collected data entities. This has always been implicit in the generations of this package: "questions" that define how the question is rendered in the form, and "question_responses" that store the data returned by the user from the form. For most of the as_form elements, there are parallel entities: one that defines the "blank fields" of a blank form, and one that represents the data that the user fills into those blank fields. The CDISC standard maintains this distinction by the use "_def" suffixes for the former and "_data" suffixes for the latter. This gets a bit cumbersome, so here we're suggesting just using the "_data" suffixes for "real" study data and leaving implicit the fact that similar entities without a suffix are in fact the metadata entities.

Discussion

Here are a variety of questions/issues:

Version management for all Assessment components
The requirements for this package stipulate that there be flexible mechanisms for reuse of all elements of the Assessment component hierarchy. It should be possible to include an entire Assessment as a section in another Assessment; a Section from one Assessment in a different Assessment; and an Item in n different Sections of m different Assessments.
If each time components were reused in this fashion, a new "copy" of each were added to the system, version control would be a moot issue since there would only be single versions of each component. (That is, if a question with item_id = 42 is "What is your gender: Male | Female", and we want to ask this question in a new Assessment, we could retype the whole question or insert a copy, either way getting a new question with item_id = 4242, say. There are now single versions of two questions that are identical semantically but completely distinct in the DB.)
However, the requirements for this package stipulate that it should be possible to aggregate user responses to a given item even if that item is asked in multiple assessments. Further, if an item is altered (say, by adding or removing choices) or a section is altered (say by adding or removing an item), it still should be possible to aggregate user responses to this assessment even if these alterations are performed during data collection, so that the first user responses reflect the initial state of the assessment, while subsequent user responses reflect the altered state of the assessment.
By using the Content Repository (by creating all these components as subtypes of cr_revisions), we get some automagic versioning help. By using the item_revision_id and section_revision_id keys in the as_section_item_map table and inserting a new row each time we create a new version of either the item or the section, we should be able to maintain a consistent record of the state of the entire assessment.
Put another way, we need to be able to generate easily and efficiently the entire component tree of an Assessment through all versions of that Assessment. How do we accomplish this? Basically we need to map every version of every component to the version of the Assessment that contains it. Put another way, we need to associate the Assessment's revision_id with each Section/Item/Item_data etc revision_id.
Originally I had suggested that a single flag (mdv_id for metadataversion id) be stored in each row of each of the datamodel entities as well as the collected data tables. This approach follows the CDISC standard in a direct fashion, but does not leverage the Content Repository model in OpenACS. This "metadataversion id" really just captures what an Assessment's "revision_id" would capture when the Assessment is a cr_revision.
The cr_revision table doesn't have a column for a container object's revision_id, however. Either we'd need to subtype this table to create a common type for all other object tables (ie for Sections, Items, etc). Or else we could accomplish this with a simple mapping table having at least two columns: assessment_revision_id and component_revision_id. This latter approach seems far superior, and I've added it to the datamodel graphic. Does this approach make sense?
One other issue that remains unclear to me is whether the CR's assumption of only a single "live" revision will cause problems here. Since different users may want to deploy different versions of a single Assessment, this package needs to accommodate in effect multiple "live" revisions. Does the CR prevent this?
Object types for these entities
We've clearly established at this point that most of these entities should be in the content repository (CR). Given this, we'll need to determine fairly early on what performance issues are there and how do they impact how we write our code? It's going to be crucial that a "get_assessment_info" procedures return quickly, since they'll get called when the assessment is "written to html" when a user requests the assessment, and also when the user submits data that needs to be checked against validation criteria, etc. It is essential that the user experience be snappy here with reasonable hardware.
One other issue is whether the various mapping tables (that we use rather extensively here) can be plain old mapping tables or whether there is value in using the acs_rels style of mapping. Since acs_rels are themselves acs_objects, we could use them to control permissions, but we don't think this is necessary since access control happens elsewhere and the performance slowdown (even a minor one) wouldn't provide any benefit. So the decision is: plain old mapping tables.
Data validation and Navigation Control
In our current implementation of a "questionnaire" package, we placed range-checking code in the question entities, permitting quick and easy but limited types of checks that can be done. In "complex survey", the basic section and branch constructs use responses to single choice selections as the hinge point for branching. How "navigation control" interacts with "data validation" to define "sequencing" is probably the biggest, most complex aspect of this Assessment package; we take that on here.
Annotations, signatures, audit records
These are entities that have particular importance in clincial trial areas, and presumably other areas like finance.
- The as_annotations table could be where teachers can append grading comments to students' exams, also. It is basically a "comments" table and could actually be implemented by a generic comment facility if one existed (similar to general_comments in 3.2.5).
- The as_signatures table is where hashes of submitted user data could be stored for "nonreputability" purposes.
- The as_audit_records table could be where a record (including some reason for the action) could be made of any user update of any data item. Note that the as_audit_record table is not an audit table of the type that ecommerce used extensively in 3.2.5. Rather it is a table that captures timestamps, reasons, and other information when a revision is made. In other words, it is a helper table that augments cr_revisions (though since cr_revisions already has title, description and publish_date columns, there actually may be no need for this separate as_audit_record table; this will depend on how much stuff needs to be recorded about each revision).
Question types
In addition to the various question types identified here and here, we need a format that allows a user to add free text as an alternative to a set of choices (radio buttons or a select). This is most commonly needed when the question poses a list of alternatives plus a final "other" choice, which if selected, requires the user to type into the textbox. When this is built into a select, this is often called a "combo-box" (maybe this term is also used in the radio button case as well; I'm not sure).
Obviously, this case differs from other question formats because the user submits two responses (the choice plus the text) and not just one. However, it appears that the combination of attributes in the current model provide adequate expressiveness as long as "combo_box" is added to the constraints for presentation_type.