Forum OpenACS Q&A: DB Smarties Question :) - Inverted Index / Concordance of Unique Words in a Field
Basically I have been asked to implement a spell checker - I was planning on just generating a list of unique words, and then visually inspecting the list for variances, since this is more of a controlled vocabulary field than anything else.
Procedurally it is easy - is anyone aware of a silver bullet in the form of a query or stored procedure that would accomplish this? Curiousity demands I ask ..
Obvious extensions to this would be - count of unique words, index into rows containing those words, etc. I AM NOT asking about those solutions, but this thread would be a good place to discuss them since inverted index and concordance both drew blanks when I did a search :)
check section 19.6.1 of the PG 7.3 manual (Procedural Languages / PLPGSQL / Control Structures)
Given a field in a table that has multiple words separated by spaces, you'd create a stored procedure that enumerates that field in each row and splits it's contents based on whitespace. The procedure would insert each word component into a ?cursor? or temporary table. Then maybe you could run a 'group by' select call to get a distinguished list of words with their respective occurrences; e.g.,
Select Count(*), word, from my_temp group by word;
I have better knowledge of Interbase procs and can visualize a solution. I'm not that strong with respect to PostgreSQL.
In terms of the split function, PostgreSQL may have a default function to handle it:
Or, you can create your own and bind it to the pg service.
I wouldn't mind a spellchekker utility within this forum.
I am working on a spell-checker for ad_form that uses ispell/aspell, and have uploaded a (more or less working) patch against 4.7: http://openacs.org/bugtracker/openacs/bug?bug%5fnumber=165
Maybe this is something you could use/improve. I have made a few improvements in the handling of HTML and will upload a new patch in the near future.
This version, however, does not work with the form-builder alone - it needs ad_form, but I have plans on investigating how it could be refactored and implemented as a datatype/widget combo (making it a cleaner implementation and making it available without the need for ad_form).
If you want to help out with this, let me know.