Forum OpenACS Development: Updating the Search service contract to enable new ways to search

I am looking into updating the search FTSEngineDriver service contract to enable new ways of searching. Right now, you can only do a full text search. There is no way to restrict results by additional metadata such as package_id, or object_type.

I am bringing this up here to get feedback on different ways to restrict search. What applications can you think of that might need this extended search feature and what attributes would be good to add to the search?

General attributes I can think of: package_id(parent_id), object_type, creation_date or last_modified date, category(s)

More application specific, CMS for example, author.

Is there anything else?

Hi Dave,

though I am not sure if that is the right place here to post but from what I understood it is not possible to enable FTS for any acs object, right? What exactly needs to change to allow FTS for acs objects?

Another feature I once discussed with Malte was the ability to store the content of text-based objects (doc, pdf, html, rtf, ppt...) as text and allow FTS for that.

Do you know what the status or roadmap on these two issues is?

Greetings,
Nima

Oh! Something interesting I found for the latter:

Tom Server
http://tom.library.upenn.edu/convert/tom.html

And the allowed conversions:
http://tom.library.upenn.edu/convert/sofar.html

Basically a Perl application that wraps other apps like lynx, lynx, wvWare, pdf2html, acroread, ps2pdf, txt2html, txt2tex, convert, ppmtogif, xml2html...

Nima,

Right, not all objects should be searchable, but I think it is useful to query the acs_objects table for additional attributes for objects that are searchable. On HEAD there are more object_types that have been search enabled.

On indexing additional formats, that is also on my list of things todo. The easiest thing to do is register a command to extract text from various mime types. There are utilities to do this for most common formats.

It seems like the easiest way to do all this would be to use the Google API to enable search. It wouldn't work for non-public addresses, but it's fast, reliable, and easy to set up... It would also be far less work on your part.. And as a plus, whenever they index new types of content, we would get it for free...
Jade,

What do you mean?

By using search that can search metadata within the OpenACS database you can definitely give more interesting searches than just crawling the actual web pages.

Also, I believe there is a limit to the number of searches you are allowed to peform with the Google API.

It would be awfully nice if whatever service contract extension you come up with were to be finally implemented for Oracle, too...
Don,

Yes, an oracle implementation was also on my mind, but not my first priority. Any volunteer familiar with or willing to learn Oracle full text indexing would be most appreciated. I looked into this two years ago, and could not quite figure it out myself. I believe it invloves PL/SQL service contract implementations which are possible, but not quite straightforward with the current acs-service-contract.

So anyone who is interested in Oracle search is welcome to contact me or post to the forums. I would be happy to coordinate with anyone who can help. For that matter, anyone interested in Postgresql based search is also welcome to comment or assist in any way. Any input is appreciated so that we can get it right (or close enough) on the first try.