Forum OpenACS Development: complex survey datamodel suggestion/question

Hello folks,

My partner and I are doing some hacking on Malte's excellent survey package to have it check answers for our chemistry students.

So here's some commentary from a chemist - it may generalize to what's needed for an assessment package that serves science students:

1- storing numbers as numbers in postgresql doesn't seem to work as I need it to.  If the student enters 2.00e4 as the answer, I don't want it stored as 20000, because that implies a different degree of precision.  Instead, it needs to be stored as 2.00e4.  ("Significant figures" are a big deal.)  It also makes more sense to store scientific notation as entered so that the student can edit his/her response without converting to float notation first.

2- The datamodel needs to be able to handle really big (and really small) numbers.  If the correct answer is 6.02e23, I need to be able to do some math with that number, for instance, to be able to see if the student's answer is within 1%.  I can do that if the value in the db is the string 6.02e23 (tcl understands exponential notation).  I can't do it if the database has 60200000000000000...0.

I've addressed my issue by switching the 'numeric_answer' column to a varchar in my hacked survey.  That preserves strings as entered and prevents postgresql from eating scientific notation or messing up tcl's ability to parse big/small numbers.

If anyone working on assessment wants them, I can contribute some procs for parsing numeric answers for correct significant digits.

Collapse
Posted by Stan Kaufman on
Cathy, good point. We already had separate integer and numeric data types in the metadata model for as_item_types; I've added exponential also; see https://openacs.org/projects/openacs/packages/assessment/design/as_items#item_types

In the table where we store returned data (as_item_data -- see https://openacs.org/projects/openacs/packages/assessment/design/data), we use a single numeric_answer column to stuff the actual values for integer and numeric data, but we know from the corresponding as_item_types value which type it really is. Similarly, we can stuff an "exponential" value in the varchar_answer column of as_item_data but we'll know that it's a "number string" from as_item_types.

Thanks for your offer of the procs. Hopefully we'll start cranking out code for Assessment soon.

Collapse
Posted by Malte Sussdorff on
In some other systems which had to deal with dynamic data, all information has been stored in one "value" column of type text. Wouldn't this be the better solution and have the TCL API make sure that one valid values get stored in the column?

Rephrased: Is there a compelling need for the database to make sure of integrity, thereby adding additional columns to the table and confusing things, instead of having the API make sure that only items that have been checked for validity (pending on the item_type as Stan pointed out) are stored.

Collapse
Posted by Stan Kaufman on
Malte, here's a reason to put integers into an integer column, and real numbers into a numeric column: all the db's built-in aggregate functions (avg, stddev, sum, variance, etc) and mathematical functions and operators (add, substract, etc etc) won't work. If we do this, all data extraction procs for data in the Assessment system will have to select data, drag it back into tcl, and do the math there. This violates the "push everything back into the database that can be pushed there" principle.

This would be exactly like relying on the tcl to enforce uniqueness or any other kind of constraint. Sure, it could be done, but that's not considered best practice as far as I understand.

What would be the benefit of storing everything (numbers, strings, etc) as a string and then figuring out after you get the data what it is? Surely the amount of disk space saved by not having empty integer and numeric columns is trivial. And we need the other kinds of *_answer columns for holding references to cr_revisions (content_answer) etc etc anyway.

Seems to me that rather than pushing us to conflate all data into a single omnibus string type, Cathy's requirement just means we should add something to handle the exponential data she mentions. And in PG, it looks to me as if the "real" or "double" data type should work fine, in which case we should use it instead of the hackish "exponential type (stored as a varchar)" I suggested.

From http://www.postgresql.org/docs/7.3/static/datatype.html#DATATYPE-FLOAT :

"Normally, the real type has a range of at least -1E+37 to +1E+37 with a precision of at least 6 decimal digits. The double precision type normally has a range of around -1E+308 to +1E+308 with a precision of at least 15 digits. Values that are too large or too small will cause an error. Rounding may take place if the precision of an input number is too high. Numbers too close to zero that are not representable as distinct from zero will cause an underflow error."

Cathy, am I wrong here? Can't "real" or "double" work in your situation?

Dunno what Oracle-equivalent there is, but surely there is one. (Though since Oracle still doesn't even have a boolean, maybe not.)

Collapse
Posted by Cathy Sarisky on
Real or double can satisfactorily hold the student's answer, provided we don't care about being able to echo back the student's answer exactly as typed.  If, however, we want to know whether the student typed 6.02x10<sup>23</sup> or 6.020x10<sup>23</sup>, stuffing it into either type of column will result in it being converted to 602000000000000000000000.  And, in the event that we want to do some math on it in tcl despite plans to push everything to the database:

% set big 6.02e23
6.02e23
% set answer 6.01e23
6.01e23
% expr $answer/$big
0.998338870432

% set big 6020000000000000000000
6020000000000000000000
% set answer $big
6020000000000000000000
% expr $big/$answer
integer value too large to represent

Collapse
Posted by Guan Yang on
Use Mpexpr? 😉
Collapse
Posted by Cathy Sarisky on
*blush* ok, ignore the tclsh stuff i just posted.  *IF* I write the numbers as non-integers, I can still do math...
Collapse
Posted by Malte Sussdorff on
Stan, this is a very good reason. I had this argument with some friends and needed canon fooder which you brilliantly gave. I still think that most of the checking should be done by the check_item functions, as this is the only way to give a nice error message back (or am I mistaken and postgres throws up with a reason why it did not like an insert?).
Collapse
Posted by Stan Kaufman on
Cathy, it appears to me that your example question really doesn't ask for a "number" from the student but a "string representation of a number". If I were setting up such an as_item, I'd set the data_type as varchar and then set an as_item_check to specify the user response to be "string eq $user_response $correct_answer" (or something like that -- we haven't actually worked this out yet). I wouldn't even handle this situation as a real number since the point of the question is whether the student types in the correct number of significant digits and order of magnitude in the right format.

In contrast, for data-driven apps like I need, the point is to collect various kinds of numbers upon which we can do various kinds of statistics. Of course, all that can be done in SPSS or SAS once a bulk data extraction is accomplished, but eventually it would be great to build in R via PG so that lots of cool analytic and graphing stuff can be done within the Assessment package. If we munge everything that Assessment collects into strings, then this becomes impossible. And in the near term, the ability to be able to do means, stddevs, and other simple descriptive statistics on the fly is an absolute requirement for the Assessment package. For me anyway. 😉

Collapse
Posted by Dave Bauer on
I think the best solution would be to define a scientific notation form datatype to do the checking. Then you could store it in the database however the database liked in a number column, and transform it on display if necessary.

I guess this sort of breaks the "store what the user gave you in the database" concept we use for textareas.

Collapse
Posted by Stan Kaufman on
Malte, you're right that we need the tcl check to generate the user-friendly errors before we hit the db for inserts. The real benefit of strong-typing the data is when we are writing the procs that extract the data. Granted, this is more important for the "generic data collection tool" uses of Assessment than for the "academic exam" uses.