Here is what Dirk, Jeff, and I discussed today regarding indexing of CR content.
Right now content revisions are indexed, so you can have multiple versions of the same item in the index. This really isn't how we expect it to work, since you only should see the live revision in the search results.
So this can be simplified where a trigger on cr_items adds the item_id to the search_observer_queue for indexing when the item is created, edited, or deleted. Changes to latest or live_revision will cause the item to be queued for indexing.
In the content_item datasource callback, the item will be indexed if there is a live revision and publish_status is "live". CR based applications would need to correctly set these attributes for search indexing to work. This may require changes to packages that do not set the live_revision or publish_status.
The datasource procedure for content_item will find the revision to index, and call either a content_type specific callback, if one exists, or the default content_revision callback.
The main content of a revision may contain binary content such as a word document, or PDF, etc... A callback for converting the binary content to text will be called, if one exists. And additional attributes of the revision may also be added to the content for indexing.
At this point the datasource will be returned to the search indexer procedure and the data will be sent to the search engine for indexing.
I will be posting information on the callback signatures for the binary to text conversion.