Forum OpenACS Q&A: Parsing an incoming multi-part e-mail message

The facility for handling forum replies by e-mail currently assumes that they are all plain text, so when one comes in that's in HTML the entire source of the message gets inserted into the database.  Oops.

I'm trying to avoid reinventing the wheel and am looking for existing code that will parse the message and let me pull out the just the text/html piece.  Unfortunately, most things like TclMime seem to be oriented around creating an outgoing message, not processing an incoming one.

Any pointers?  Thanks!

Collapse
Posted by Matthew Walker on
You should be able to pass the MIME message to initialize and have it parse out all the parts and then use getproperty to find the one you want and getbody to retrieve it. You may also want to look at the mime module on tcllib, this has a few more capabilities.
Collapse
Posted by Jonathan Ellis on
I wrote a perl mail parser in 2000 where CPAN did 90% of the work.  I would try to dig up the code but it looks like Rocael found a module that makes it even easier.
Collapse
Posted by Matthew Walker on
I read my reply from earlier and it didn't make much sense to me so here's some code that works with the mime module from tcllib:

set mime [mime::initialize -string $msg]

set content [mime::getproperty $mime content]

if { [string first "multipart" $content] != -1 } {
    set parts [mime::getproperty $mime parts]
} else {
    set parts [list $mime]
}

foreach part $parts {
    switch [mime::getproperty $part content] {
        "text/plain" {
            set plain [mime::getbody $part]
        }
        "text/html" {
            set html [mime::getbody $part]
        }
    }
}

It will recurse one level deep into multipart messages and try and find both the plain and html versions. It's possible that the message can be deeper than one level but I wouldn't think so for a reply to a forum posting (it's mainly spam trying to avoid filters where I've seen that).

Collapse
Posted by Bruno Mattarollo on

You might also want to take a look at the email library from Python 2.2.3. I am using it to generate emails, can also be used to parse emails.

Hope this helps.

Collapse
Posted by Janine Ohmer on
Matthew, thanks, this is a big help. I had played around with initialize but it didn't seem to be working for me; the problem was that I was trying to use mime::parsepart instead of mime::getproperty (the Sloanspace version of mime.tcl is different from, and appears to be newer than, the one in the official TclMime distribution, and has extra functions).

Unfortunately, using your code I'm getting the same result as I was getting with my tests - it still treats the whole thing as plain text. My test message body was captured from within the proc that handles incoming e-mail so it "should" be an accurate representation of what I need to handle. Here is my test script:

ReturnHeaders

set r_dir [acs_root_dir]
source $r_dir/tcl/base64.tcl
source $r_dir/tcl/mime.tcl
package require mime

set body "

> This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

--B_3138013439_31893367
Content-type: text/plain; charset=\"US-ASCII\"
Content-transfer-encoding: 7bit

Ok, this should be a real HTML message.

On 6/9/03 1:22 PM, \"system.mit.edu mailer\" <system-40576-450@system.mit.edu>
wrote:

> Forum: testgroup Forum
> Thread: HTML reply testing
> Author: Janine Sisk (jsisk@mit.edu)
>
> This is a message


--B_3138013439_31893367
Content-type: text/html; charset=\"US-ASCII\"
Content-transfer-encoding: quoted-printable

<HTML>
<HEAD>
<TITLE>Re: HTML reply testing</TITLE>
</HEAD>
<BODY>
<FONT FACE=3D\"Verdana\">Ok, <B>this</B> should be a <I>real</I> HTML message.<=
BR>
<BR>
On 6/9/03 1:22 PM, "system.mit.edu mailer" <system-40576-450@system.mit.edu> wrote:<BR>
<BR>
<FONT COLOR=3D\"#0000FF\">> Forum: testgroup Forum<BR>
> Thread: HTML reply testing<BR>
> Author: Janine Sisk (jsisk@mit.edu)<BR>
> <BR>
> This is a message<BR>
</FONT></FONT>
</BODY>
</HTML>


--B_3138013439_31893367--

"

set mime [mime::initialize -string $body]

set content [mime::getproperty $mime content]

if { [string first "multipart" $content] != -1 } {
    set parts [mime::getproperty $mime parts]
} else {
    set parts [list $mime]
}

foreach part $parts {
    switch [mime::getproperty $part content] {
        "text/plain" {
            set plain [mime::getbody $part]
        }
        "text/html" {
            set html [mime::getbody $part]
        }
    }
}

ns_write "
plain: |$plain|
" (I left out $html because it's not being set)
The output of this is
plain: | > This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3138013439_31893367 Content-type: text/plain; charset="US-ASCII" Content-transfer-encoding: 7bit Ok, this should be a real HTML message. On 6/9/03 1:22 PM, "system.mit.edu mailer" wrote: > Forum: testgroup Forum > Thread: HTML reply testing > Author: Janine Sisk (jsisk@mit.edu) > > This is a message --B_3138013439_31893367 Content-type: text/html; charset="US-ASCII" Content-transfer-encoding: quoted-printable Ok, this should be a real HTML message.<= BR>
On 6/9/03 1:22 PM, "system.mit.edu mailer" <system-40576-450@system.mit.edu> wrote:

> Forum: testgroup Forum
> Thread: HTML reply testing
> Author: Janine Sisk (jsisk@mit.edu)
>
> This is a message
--B_3138013439_31893367-- |
I will keep looking at this as well, but if you can spot my error please let me know what it is! Thanks.
Collapse
Posted by Janine Ohmer on
Well... the problem is that the message body does not start with a content-type of multipart/* or message/*.  If I'm reading the code right either of those would cause the rest of the message to be parsed, but because they aren't there it just stops after processing the first couple of header lines (and this only works if I remove the comment and the top boundary line, which the parser doesn't know what to do with).

This message was sent by Microsoft Entourage on a Mac, but I have another message that was sent by Outlook XP and although it looks a bit different it also lacks a content-type of multipart or message.

The version of mime.tcl I have says that it provides version 1.3.2.  I don't know where this file came from, and it's possible that it's bogus.  I would go back and try 1.2, but the only one I can find out there is 1.1, which is awfully old now.

Suggestions?  Am I on the right track here, at least?

Collapse
Posted by David Walker on
Your test message appears to be missing the actual headers of the e-mail which should contain the content-type header and the boundary.
Collapse
Posted by Janine Ohmer on
Ah...

If I add a line like so:

Content-Type: multipart/alternative; boundary=\"----=_NextPart_000_005F_01C32E8B.2C7DCE40\"

to the start of my message, Matthew's example code works just fine.  And in looking at a message I sent to myself in Apple Mail, I see that that content type *is* included but it's up higher in the message, in the headers.  So it's missing from my sample message but if I'm parsing a real message I'll have it.  *phew*!

Thanks for the tip, Matthew!

Collapse
Posted by Matthew Walker on
You should be passing in the complete message including headers to mime::initialize. In particular the Content-Type header. By doing this it is also able to work out that a plain text message is just that even when there are no MIME related headers and my code will still work and pass back the plain text.

In terms of the versions of mime.tcl I believe TclMime was merged into tcllib. You can get tcllib from <http://sourceforge.net/projects/tcllib/>. I'd suggest actually using the latest version of mime.tcl from the sourceforge CVS, there's been a few bug fixes recently. It will probably require a couple of lines of editing to work in aolserver/openacs to remove some dependencies (on Trf and something I can't remember), I'll dig these out if you like.