Forum OpenACS Q&A: Handling of Word upload & general question

Hello. I'm evaluating a number of open source community/CMS packages on which to base a system for creative writing workshops. After finishing a simple ground-up PHP test for 2 workshops, I've looked at some frameworks: PHP solutions (Xaraya, XOOPS, Drupal), Zope X3, and just started looking at OpenACS.

One of the key functions for my site will be the ability to upload Microsoft Word documents (or RTFs) and have it auto-converted to HTML. I don't like the idea of having writers share Word files (file size, macro viruses, inability to print stories directly from any browser). In my test system, I had authors "Save As" to HTML and upload, then made a filter cleaning up unwanted tags. But with WebDAV-enabled directories, it might make more sense to allow direct editing of Word files and do an on-demand HTML conversion.

I've done some OpenACS searching and found "Word uploading and HTML conversion" as a wish list point, but there's also a reference to "content repository uses the INSO libraries included with Intermedia to support conversion of binary files such as Microsoft Word documents to HTML." Initially, I thought I'd have to figure out how to patch in OpenOffice filters. Have others come up with a solution or thought about use of OpenOffice components for conversions?

Collapse
Posted by Mark Aufflick on
The INSO libraries are for the Oracle version only, and you need the intermedia search license whcih adds even more cost to a basic oracle setup.

I have previously used the wvWare command line tools with success (http://wvware.sourceforge.net/), though not with OpenACS.

There are some useful looking openacs.org threads mentioning wvware:

http://www.google.com/search?q=site%3Aopenacs.org+wvware

Interesting sidenote: google feeling lucky for "word to html" gives you a page on philip.greenspun.com :)

Collapse
Posted by Dave Bauer on
I wrote soem code (its in cvs /packages/xcms-ui/tcl/mime-procs.tcl) to convert Word to HTML. It runs through wvWare then Tidy.

It seems to be pretty effective. It was used to covnert a few thousand Word documents to be inserted into the content repository.

I want to finish this feature to allow defintion of filters for more types of conversion.

Collapse
Posted by Alfred Werner on
ad_proc -public mime_type_convert::doc_to_html {
    filename
} {
    @param filename full path to file to be converted
} {
    # let's just start with inputting a Word file and outputting some xhtml
    set tmpdir [lindex [parameter::get -package_id [ad_conn subsite_id] -parameter TmpDir -default "/tmp"] 0]
    # create a temporary file to hold the converted data
    set new_filename [file tail [ns_mktemp "${tmpdir}/fileXXXXXX"]]
    # convert to HTML
    exec /usr/bin/wvHtml --targetdir=${tmpdir} ${filename} ${new_filename}
    # clean up word HTML
    set err [catch {exec sh -c "(tidy -c -i -q ${tmpdir}/${new_filename} 2> /dev/null ;/bin/true)" } msg]
    return $msg
}
Collapse
Posted by Bill Katz on
Thanks for the info. Just went out and bought a Tcl book and plan on playing with OpenACS. It's been a while since I've done shell-like programming, but the conciseness of the conversion code (using wvWare and tidy) is a nice contrast with the PHP and C++ programming I've done more recently.

If the suggested conversion routines work well for the new Word formats, I'll go that route. If not, I'll take a look at OpenOffice and how to graft their work into a general conversion utility. (or the TOM project)
Regards.