we use
wvWare to parse word docs.
It's pretty decent. It will covert to html or text, and also rip out embedded pictures. I had to hack on the config files of the earlier version we use, but it's been updated since then and seems well featured.
For what it's worth the parsing instructions/configs are in XML.