In response to your last point about Word documents Don, swish++ comes with a program called
extract which can be used to index
binary type documents such as Word documents. From the README file:
6. Index non-text files such as Microsoft Office documents
A separate text-extraction utility "extract" is included to
assist in indexing non-text files. It is a essentially a
more sophisticated version of the Unix strings(1) command,
but employs the same word-determination heuristics used for
indexing.
It's not the most elegant solution, but it seems to me that it would be something that would be workable. I would think that you could also use something such as antiword to convert a Word document to text and then process it using the normal text indexing features.