Forum OpenACS Q&A: server side document to .pdf conversion

I would like to enable my server to automatically convert files
uploaded to my file-storage (mostly .doc, .xls, .ppt etc.) into .pdf
whenever a user clicks on a button "convert this document to pdf"
Additionally I would like to add a password to every pdf that has
been created. The password will consist of the "user_id +
version_id"...

Has anyone experience with such a conversion. What are the costs
concerning Adobe (I am looking for a low budget solution)???

Collapse
Posted by David Kuczek on
I made a little mistake...

Actually I want to convert the document to .pdf first and whenever a user wants to download that document, a password (his user_id + version_id) is being enhanced to the existing .pdf!

The user downloading the .pdf will have to type in his user_id + version_id in order to open the .pdf...

Collapse
Posted by Bruno Mattarollo on

I don't know for .doc and other files, but if you have an XML one you could use FOP, it's simple to use and extremely low budget :)

Collapse
Posted by Bjorn Thor Jonsson on
This article might be helpful:
Create a PDF Service with Samba
Use GhostScript to create a PDF document out of any PostScript printer job
http://www.planetpdf.com/mainpage.asp?webpageid=1736
it describes how to set up a Samba pdf file printer and how to use it with the lpr command (I don't know how lpr handles .doc, .xsl etc., a specific driver or converter might be needed to spit .ps to the printer). I'm too new to tcl and ns api to tell how to send a command line from a web script, but it must be possible.
Collapse
Posted by Sam Snow on
What kind of volume are you thinking about?

Adobe does have such a thing as the Acrobat Distiller Server: http://www.adobe.com/products/acrdis/main.html
but it is looking to convert a postscript file, not start with the raw document. And yes, it is pricey: $5,000 US for 100 registered users, and $10,000 for unlimited.

Getting into postscript is going to be the kicker. I don't know of anything that can take all those formats and export them into postscript.

Maybe you should make your users do that part, and upload the postscript document? From there it looks easy and cheap.

Generating a postscript file on windows is as easy as having a postscript printer driver installed (even if you don't have a printer for it) and then printing to a file!

Collapse
Posted by C. R. Oldham on
I don't see how you are going to achieve this on the server without actually invoking a copy of Word/Excel/PowerPoint on a Windows box somewhere.  And then, aren't those files going to need the fonts embedded in them in case the machine that you are running Word/XL/PP doesn't have them installed?

Lastly (sorry to be a wet blanket--it really is a great idea), there are tons of job options to be set in Distiller, and sometimes you want different options depending on the contents of the file.

This is a tough problem--it seems like it would be easier just to convert the file to PDF before you upload it.

Collapse
Posted by David Walker on
This article mentions briefly one man's coping with this problem on the desktop.

http://desktoplinux.com/articles/AT5096230660.html
Collapse
Posted by John Sequeira on
Doc2Pdf looks pretty close to what you want.

Doc2pdf is an email robot that converts Microsoft Office attachments (.doc, .ppt and .xls) to PDF files. All you need do is carbon-copy (CC) doc2pdf when you email a Microsoft Office document. Doc2pdf converts the attachment to a PDF file and sends the PDF file, as an attachment, in a reply to all recipients.

It's an email robot, but you could probably hack it to work serverside.

Collapse
9: Nevermind (response to 1)
Posted by John Sequeira on
I posted without looking at the software -> it requires a dedicated windows box to run the conversion using MS Office Viewers.  You could run VMWare on your server (ugh),  but short of that it's probably not what you're looking for.
Collapse
Posted by David Kuczek on
This is part of the Link that David posted. It looks to be close to my solution:
<br><br>

<i>"PDF Creation
<br><br>
Unfortunately I have not been able to get OpenOffice's print-to-PDF option to work for as long as I have been using it. In the meantime, the absolutely excellent KDE printing architecture (see below) allows one to print to a PDF. So, we simply print to a Postscript (.ps) file, click on that to open it in KGhostView, and then click "Print" and choose "Print to PDF." Voila, you have a gorgeous PDF.
<br><br>
Prior to the KDE print architecture, we used createpdf.adobe.com . For $10/month you get to create an unlimited number of PDFs from many file formats, including Postscript (which all UNIX apps with printing capability can export). It also converts .doc and .xls files as well, so it's just as useful for those who us MS Office but would like to produce open, professional looking documents for export to your customers or business partners."
</i>
<br><br>

Adobe's createpdf.adobe.com has actually everything that I would need: You can upload a file in almost any format, choose the appropriate security options (in my case: password), and let them send it to you via email, link or download in browser...
<br><br>
It would be nice if I could automize this procedure, but I doubt it will work... (haven't checked the Terms of Use either) It would be great if my server could log into Adobe's service, convert the file and receive the pdf automatically? Any suggestions?
<br><br>
Does anyone have experience with converting a document on KDE and KGhostView how it is descibed in the first part?
<br><br>
Thanks

Collapse
Posted by David Kuczek on
Sam,

at the beginning the volume will not be that high (20/month), but I can't imagine myself doing all of this manually when the volume passes 100/month... So I try to find solutions for the time when the volume gets high!

Collapse
Posted by David Walker on
it is possible to create a script that uploads a file to the remote site and then
downloads the converted version.  You might have to learn some more about
rfc1823 (file uploads) first.  You may have to program some authentication
headers or a login process into your script as well.

Since the adobe site takes 15 minutes when you're a subscriber, you would
have to schedule an task to retrieve the completed pdf 15 minutes later and
figure out what to do with it next.

Collapse
Posted by David Kuczek on
Where could I check for existing scripts (examples) that do this kind of procedure on other sites in order to see how they exactly proceed?

Adobe either sends you a link to a download page or directly attaches your new .pdf to their email...

This would be the best procedure:

1. A person uploads a .doc, .xls etc. to an OpenACS webservice (file-storage)

2. OpenACS connects to create.adobe.com, sets the security settings (they are on the same page as the upload button etc.) and lets Adobe deliver the new .pdf to the OpenACS webservice's email address.

3. An email handler automatically gets the email, regexps the title of the attachment (i.e. document.doc?version_id=100) and inserts the new .pdf as a new version of our document into the database...

That would be perfect and a nice service for OpenACS 3.x 4.x too...

Collapse
Posted by Mike Monette on
I've had some limited experience with this, specifically html to PDF conversion. I ended up using a product called htmldoc, which gave me reasonable results and speed. I recall using ghostscript in conjunction with another parser which produced postscript, but if I recall correctly, the ghostscript execution was rather slow (of course, I was converting >100 pages with tables in one shot). A potential solution for you would be to use wvWare and friends in conjunction with ghostscript.
Collapse
Posted by Nathan Carter on
David,

When I was at my last job we solved the same problem using a combination of AbiWord/Ghostscript to perform .doc --> .pdf conversion.  As several posters have mentioned, the difficult part is not .ps to .pdf - Ghostscript handles this nicely.  Getting .docs into .ps is an utter nightmare, though.  At our request, one of the developers of Abiword added a command-line "print-to-ps" function that works on *nix  builds of Abiword - it's still there as far as I know.

In the end, we weren't particularly happy with Abiword import filters - they don't handle tables, among other things.  Toward the end of my time there we were investigating OpenOffice as well:
http://api.openoffice.org/source/browse/api/odk/examples/java/DocumentConverter/

I've done pretty exhaustive research in this area, and basically you're limited to:
1) a windows box running a closed solution - activepdf is one of the better known players, but you can check out several on pdfzone.com
2) one of the hacks I mentioned above - I don't know of any better solutions running on *nix.

Collapse
Posted by David Kuczek on
Hello Nathan,

the openoffice solution sounds and looks to be pretty superior... I heard that SUN will start *licensing* StarOffice from version 6.0 on! OpenOffice will remain opensource though I believe.

Do you know what kind of people are/were hacking around with an openoffice conversion solution? Which community could I bother with this?? I already talked to some developers at openoffice.org, but they all seem to work for SUN and told me that they are developing a closed source solution doing conversion. (What great of an opensource community 😉... That was ~4 month ago - things might have changed?!

Collapse
Posted by Radamanthus Batnag on
I ran into a similar problem - converting M$ Word .doc files to text, and the OpenOffice solution seems to have matured a lot since the start of this thread.

Here's a Java example program for batch conversion of text files from any supported OpenOffice format to any other supported OpenOffice format:

http://api.openoffice.org/source/browse/api/odk/examples/java/DocumentConverter/Attic/

Collapse
Posted by zet ucu on
How to do it with Java:

---------------------------------------------------
import officetools.OfficeFile;
...
FileInputStream fis = new FileInputStream(new File("test.doc")); // works with xls also
FileOutputStream fos = new FileOutputStream(new File("test.pdf"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"pdf");
---------------------------------------------------

All possible convertions:
doc --> pdf, html, txt, rtf
xls --> pdf, html, csv
ppt --> pdf, swf
html --> pdf

Maybe useful: http://dancrintea.ro/html-to-pdf/
HTML to PDF with PHP, Java or ASP

Collapse
Posted by Malte Sussdorff on
Hey David, just use JODconverter. You can read how to integrate it with OpenACS at http://cognovis.de/developer/en/openoffice.

In your case you can make your life pretty easy and just get the openoffice wrapper procs using

svn co https://svn.cognovis.de:/projop/packages/intranet-contacts/tcl/oo-procs.tcl

In them you will find a procedure called contact::oo::convert_to_pdf_using_jooconverter and this is what you need to do the conversion. Everything else is "leftover" from tries we did when the jodconverter wasn't working as expected.

If you have further questions or need help, contact me please.

Collapse
Posted by eliza sahoo on
Recently i came across a requirement to convert a wide range of file formats (.doc, docx, .xls, .rtf , .odt, .ods, .ppt etc.) to pdf file in PHP.I found a very good utility to do this, which i would like to share with you all.
You need to have OpenOffice and unoconv installed for this.
unoconv is a command line utility that can convert any file format that OpenOffice can import, to any file format that OpenOffice is capable of exporting.Some of the supported document formats are Open Document Format (.odt), MS Word (.doc), MS Office Open/MS OOXML (.xml), Portable Document Format (.pdf), HTML, XHTML, RTF, Docbook (.xml), and more.