I need to make documents in the content repository searchable that are
not plain text, like PDF's, HTML or Word Documents.
According to Oleg from the OpenFTS mailing list there won't be such a
feature in OpenFTS, so I guess I'll have to add some code to the
OpenACS search package to call the necessary conversion programs from
the indexing procedure.
Obviously the place to put this would be the search_content_filter
proc in search/tcl/search-procs.tcl
ad_proc search_content_filter {
_txt
_data
mime
} {
@author Neophytos Demetriou
} {
upvar $_txt txt
upvar $_data data
switch $mime {
{text/plain} {
set txt $data
}
{text/html} {
set txt $data
}
}
}
I guess the reason why it doesn't even do anything special with HTML
is simply because that detail isn't finished (or am I looking at the
wrong place?).
So I am thinking of adding an exec call to the proc above, depending
on the mime/type of the content. It must of course be wrapped in a
catch statement, and some safeguards should be taken that one faulty
conversion does not stop the whole indexing process (scheduled proc
timeout maybe?). Forking an exec process for each row to index is not
highly efficient, but I can't think of another solution.
To make this generally usable there should be some kind of admin
configurable mapping of mime-types to conversion program calls -
propably it would make sense to add a table to the search package for
that.
Any comments / objections / hints?