Forum OpenACS Development: document to text conversion in search indexer

I need to make documents in the content repository searchable that are not plain text, like PDF's, HTML or Word Documents.

According to Oleg from the OpenFTS mailing list there won't be such a feature in OpenFTS, so I guess I'll have to add some code to the OpenACS search package to call the necessary conversion programs from the indexing procedure.

Obviously the place to put this would be the search_content_filter proc in search/tcl/search-procs.tcl


ad_proc search_content_filter {
    _txt
    _data
    mime
} {
    @author Neophytos Demetriou
} {
    upvar $_txt txt
    upvar $_data data

    switch $mime {
        {text/plain} {
            set txt $data
        }
        {text/html} {
            set txt $data
        }
    }
}
I guess the reason why it doesn't even do anything special with HTML is simply because that detail isn't finished (or am I looking at the wrong place?).

So I am thinking of adding an exec call to the proc above, depending on the mime/type of the content. It must of course be wrapped in a catch statement, and some safeguards should be taken that one faulty conversion does not stop the whole indexing process (scheduled proc timeout maybe?). Forking an exec process for each row to index is not highly efficient, but I can't think of another solution.

To make this generally usable there should be some kind of admin configurable mapping of mime-types to conversion program calls - propably it would make sense to add a table to the search package for that.

Any comments / objections / hints?

Collapse
Posted by Dave Bauer on
I think that somewhere in here we need a service contract of some sort.

I always suspected that he applications providing the content would deliver the converted text to the indexer.

So that the service contract implementation for content repository would check the mime-type of the object and covert it to text, then provide that text to the indexer.

I am not sure which way is better. It seems easier to provide a mime-type to the indexer which can then process content from any package correctly.

I would rather put the handling of conversion of other mime-types into the search package then into the sc implementors. What would be the gain of de-centralizing it? Also the service contract already implements passing on the mime-type, so it seems perfectly prepared for that.

Packages are always free to convert to text themselves anyway and set the mime-type in their service contract implementation to text/plain, as for example notes does (in notes__datasource). They should be safe to assume that no further conversion of their content is going to happen when they set this mime-type.

I would rather put the handling of conversion of other mime-types into the search package then into the sc implementors. What would be the gain of de-centralizing it? Also the service contract already implements passing on the mime-type, so it seems perfectly prepared for that.
Right. search_content_filter was designed with document conversion in mind.
After adding document conversion for html, msword and pdf files to search_content_filter I realized that this proc is not only called from the indexer but also when displaying the results, to produce an excerpt of the matching document with the matches highlighted (which looks very good btw).

So this results in exec'ing the external conversion programs for every matching document everytime the results page is displayed, which is of course inacceptable.

Any suggestions on how to deal with this? Saving a text version for each document that has to be converted, in parallel to the actual content is unavoidable, at least if we don't want to loose the nice abstracts in the search results.

Where / how should that be done?

Collapse
Posted by Dave Bauer on
Til,

Possibly add a field to the txt index table that specifies the location of the plain text version of the content. Would we just store it in one database table? Or would aother optional method to be determined be useful?

txt is postgresql specific, but this is something that would be needed for the oracle version as well, no? Assuming there will ever be an FtsContentProvider for oracle of course ...

Well, maybe FtsContentProvider for oracle could somehow make use of intermedia's own INSO filter stuff instead of replicating functionality that's already there, but that would still not solve the problem where to get the text for generating the abstract from. So we need a solution that works for both databases I think.

Also it should be considered that the text version might become huge - imagine someone uploading a book in pdf format to file storage. Some people might not want to store the text of this in an additional postgresql table. Ideally the text version should use the same storage method as the original content (don't know if that's possible, just loud thinking).

Another option might be to just not show that abstract for documents that require expensive transformation. Which would be a pity. And at least there should be an alternative provided in the form of a description, e.g. the first paragraph of the document or something like that. Which in turn would among other changes require an addition to the content provider service contract.

Or a parametrizable limit on the size of the text version?

Collapse
Posted by Roger Metcalf on
Have you come up with a solution for the problem of re-converting documents during display?  I'm in need of document conversion for the purpose of indexing esp. for msword and pdf files.  Can you share your search_content_filter for doing these conversions?
Collapse
Posted by Tilmann Singer on
No solution for that problem yet, sorry. As mentioned in the discussion above it would be necessary to store a text version in parallel to the binary somewhere, and the project I would have needed it for went another route before that was sorted out.

Below is my version of the search_content_filter, which does pdf and word conversion, but it is just in a testing state. Among other things the executables for the conversion need to be parametrized. Also you need to switch off the abstracts display upon search results obviously, otherwise the conversion is triggered on every results page for each pdf and word document found.

ad_proc search_content_filter {
    _txt
    _data
    mime
} {
    @author Neophytos Demetriou
} {
    upvar $_txt txt
    upvar $_data data

    ns_log notice "!> search_content_filter +++ $mime"

    set file_ending(application/msword) doc
    set conversion_code(application/msword) {exec catdoc -d utf-8 $tmp_orig > $tmp_txt}

    set file_ending(application/pdf) pdf
    set conversion_code(application/pdf) {exec pdftotext -enc UTF-8 $tmp_orig}

    switch $mime {
        {text/plain} {
            set txt $data
        }
        {text/html} {
            set txt $data
        }
        {application/pdf} - 

        {application/msword} {
            # convert to text

            # get tempfile name
            set tmpnam [ns_tmpnam]
            set tmp_orig "$tmpnam.$file_ending($mime)"
            set tmp_txt "$tmpnam.txt"

            # write original data to tmpfile
            set tmp_orig_fp [open $tmp_orig w]
            fconfigure $tmp_orig_fp -encoding binary
            puts $tmp_orig_fp $data
            close $tmp_orig_fp

            # call conversion program
            eval $conversion_code($mime)

            # read temporary text file
            set tmp_txt_fp [open $tmp_txt]
            fconfigure $tmp_txt_fp -encoding utf-8
            set txt [read $tmp_txt_fp]
            close $tmp_txt_fp

            # delete tmp files
            file delete $tmp_orig
            file delete $tmp_txt
        }
    }
}