Forum OpenACS Q&A: Full Text Search of uploaded PDFs, .DOCs, XLS, etc?

Hi
I'm evaluating tools for a corporate Intranet site, and basically have come down to OpenACS or Drupal.

OpenACS seems better is most regards, but one thing I cannot find how to do:
- is it possible to do a full text search on words in uploaded PDF or DOC, etc files? I cant see any modules or docs on how to do this...

Not a show stopper, but its presence would likely make openACS what we end up with.
Thanks!

Collapse
Posted by Dave Bauer on
Dirk Gomez and I are working on this feature for the next version of the search package. Basically it will call an external program to convert the uploaded files to text for indexing. There is already a hook in the search package for this, but no external programs are defined to be called.

I will have more information on the exact implementation soon. It will support Word, PDF, XL formats at a minimum, and it should be relatively easy to add new formats.

Collapse
Posted by Steve Francis on
Great, thanks!

Any idea of a rough timeline when it might be done?

Collapse
Posted by Frank Bergmann on
Hi Dave,

we would also be interested in this information. We are evaluating the HtDig suite of document converts.

Bests,
Frank

Collapse
Posted by Ryan Gallimore on
Hi Steve,

I recently added this functionality for PDFs, DOCs, and TXT by adding a few lines to search-procs.tcl, below. You should be able to apply a similar method for Excel files.

You'll need to download and install XPDF (http://www.foolabs.com/xpdf/download.html)and catdoc (http://www.45.free.net/~vitus/ice/catdoc/).
I retyped me changes below from memory, I think it's correct!

ad_proc search_content_get {
    _txt
    content
    mime
    storage_type
} {
    @author Neophytos Demetriou

    @param content
    holds the filename if storage_type=file
    holds the text data if storage_type=text
    holds the lob_id if storage_type=lob
} {
    upvar $_txt txt

    set txt ""

    switch $storage_type {
        text {
            set data $content
        }
        file {
            # get filename instead of file contents.
            set data $content
            #set data [db_blob_get get_file_data {}]
        }
        lob {
            db_transaction {
                set data [db_blob_get get_lob_data {}]
            }
        }
    }

    #Pass $storage_type for distinguishing text/plain text
    #from text/plain file
    search_content_filter txt data $mime $storage_type
}

ad_proc search_content_filter {
    _txt
    _data
    mime
    storage_type
} {
    @author Neophytos Demetriou
} {
    upvar $_txt txt
    upvar $_data data

    switch -glob -- $mime {
        {text/plain*} {
            if ($storage_type == "file") {
              #use system cat command to feedback textfile
              #contents.
              set txt [exec cat $data]
            } else {
              set txt $data
            }
        }
        {text/html*} {
            set txt $data
        }
        {application/pdf*} {
            #use pdftotext to convert PDF file to text
            set txt [exec pdftotext $data -]
        }
        {application/msword*} {
            #use catdoc to convert Word file to text
            set txt [exec catdoc $data]
        }
    }
}

Restart the server and reload your content into search_observer_queue and you're ready to go.

Hope someone can use this. It works great!

Collapse
Posted by David Ghost on
Ryan,
I've just applied yours successfully.
Thank you Ryan.
But still I have some problems..
Most of my document written in Korean,
as a result, search indexer make index with broken characters..
So my question is how to handle MS documents composed by other than English?
I'm afraid this is not the right place to ask.
Hope to get some useful advice...