Forum OpenACS Q&A: DB Smarties Question :) - Inverted Index / Concordance of Unique Words in a Field

I tried googling and didn't find any results - maybe someone here has done this in the past - what is a good way of just generating a list of unique words (using whitespace as boundary, nothing fancier) in one field of a table?

Basically I have been asked to implement a spell checker - I was planning on just generating a list of unique words, and then visually inspecting the list for variances, since this is more of a controlled vocabulary field than anything else.

Procedurally it is easy - is anyone aware of a silver bullet in the form of a query or stored procedure that would accomplish this? Curiousity demands I ask ..

Obvious extensions to this would be - count of unique words, index into rows containing those words, etc. I AM NOT asking about those solutions, but this thread would be a good place to discuss them since inverted index and concordance both drew blanks when I did a search :)

PG 7.3 supports functions that return a rowset, so you can define a PL/PGSQL function that scans the input text and returns SETOF varchar and then join against that...

check section 19.6.1 of the PG 7.3 manual (Procedural Languages / PLPGSQL / Control Structures)

I'll take a stab at your question. I may be totally off base; though, here it is.

Given a field in a table that has multiple words separated by spaces, you'd create a stored procedure that enumerates that field in each row and splits it's contents based on whitespace. The procedure would insert each word component into a ?cursor? or temporary table. Then maybe you could run a 'group by' select call to get a distinguished list of words with their respective occurrences; e.g.,

Select Count(*), word, from my_temp group by word;

I have better knowledge of Interbase procs and can visualize a solution. I'm not that strong with respect to PostgreSQL.

In terms of the split function, PostgreSQL may have a default function to handle it:

Or, you can create your own and bind it to the pg service.

I wouldn't mind a spellchekker utility within this forum.


I am working on a spell-checker for ad_form that uses ispell/aspell, and have uploaded a (more or less working) patch against 4.7:

Maybe this is something you could use/improve. I have made a few improvements in the handling of HTML and will upload a new patch in the near future.

This version, however, does not work with the form-builder alone - it needs ad_form, but I have plans on investigating how it could be refactored and implemented as a datatype/widget combo (making it a cleaner implementation and making it available without the need for ad_form).

If you want to help out with this, let me know.