I think that somewhere in here we need a service contract of some sort.
I always suspected that he applications providing the content would deliver the converted text to the indexer.
So that the service contract implementation for content repository would check the mime-type of the object and covert it to text, then provide that text to the indexer.
I am not sure which way is better. It seems easier to provide a mime-type to the indexer which can then process content from any package correctly.