Forum OpenACS Development: How to make an object type searchable?

I have checked in a new document describing the basic steps of how to make an object type searchable. It is in no way complete, so I would appreciate your feedback. The document is available under packages/search/www/doc/guidelines.html but I am also including it below so that it is easy for you to contribute your comments.

How to make an object type searchable?

Making an object type searchable involves three steps:
  • Choose the object type
  • Implement FtsContentProvider
  • Add triggers

Choose the object type

In most of the cases, choosing the object type is straightforward. However, if your object type uses the content repository then you should make sure that your object type is a subclass of the "content_revision" class. You should also make sure all content is created using that subclass, rather than simply create content with the "content_revision" type.

Implement FtsContentProvider

FtsContentProvider is comprised of two abstract operations, namely datasource and url. The specification for these operations can be found in packages/search/sql/postgresql/search-sc-create.sql. You have to implement these operations for your object type by writing concrete functions that follow the specification. For example, the implementation of datasource for the object type note, looks like this:
ad_proc notes__datasource {
    object_id
} {
    @author Neophytos Demetriou
} {
    db_0or1row notes_datasource {
        select n.note_id as object_id, 
               n.title as title, 
               n.body as content,
               'text/plain' as mime,
               '' as keywords,
               'text' as storage_type
        from notes n
        where note_id = :object_id
    } -column_array datasource

    return [array get datasource]
}
When you are done with the implementation of FtsContentProvider operations, you should let the system know of your implementation. This is accomplished by an SQL file which associates the implementation with a contract name. The implementation of FtsContentProvider for the object type note looks like:
select acs_sc_impl__new(
           'FtsContentProvider',                -- impl_contract_name
           'note',                              -- impl_name
           'notes'                              -- impl_owner_name
);
You should adapt this association to reflect your implementation. That is, change impl_name with your object type and the impl_owner_name to the package key. Next, you have to create associations between the operations of FtsContentProvider and your concrete functions. Here's how an association between an operation and a concrete function looks like:
select acs_sc_impl_alias__new(
           'FtsContentProvider',                -- impl_contract_name
           'note',                              -- impl_name
           'datasource',                        -- impl_operation_name
           'notes__datasource',                 -- impl_alias
           'TCL'                                -- impl_pl
);
Again, you have to make some changes. Change the impl_name from note to your object type and the impl_alias from notes__datasource to the name that you gave to the function that implements the operation datasource.

Add triggers

If your object type uses the content repository to store its items, then you are done. If not, an extra step is required to inform the search_observer_queue of new content items, updates or deletions. We do this by adding triggers on the table that stores the content items of your object type. Here's how that part looks like for note.
create function notes__itrg ()
returns opaque as '
begin
    perform search_observer__enqueue(new.note_id,''INSERT'');
    return new;
end;' language 'plpgsql';

create function notes__dtrg ()
returns opaque as '
begin
    perform search_observer__enqueue(old.note_id,''DELETE'');
    return old;
end;' language 'plpgsql';

create function notes__utrg ()
returns opaque as '
begin
    perform search_observer__enqueue(old.note_id,''UPDATE'');
    return old;
end;' language 'plpgsql';


create trigger notes__itrg after insert on notes
for each row execute procedure notes__itrg (); 

create trigger notes__dtrg after delete on notes
for each row execute procedure notes__dtrg (); 

create trigger notes__utrg after update on notes
for each row execute procedure notes__utrg (); 
Collapse
Posted by Jowell Sabino on
If I make an object in a package searchable, e.g. File Storage objects, which search-related packages should File Storage depend on? What should I put on the .info file for package dependencies that APM should check? Or is this even necessary? I don't see any package dependencies in the demo notes.info file.

I'm trying to understand how openfts-driver, acs-service-contract, search, and the package being made searchable relate to each other.

Collapse
Posted by Neophytos Demetriou on
No dependency is required. Actually, the triggers require that the search package is installed but the search package is part of the core, so you don't actually need to add any dependencies.
I'm trying to understand how openfts-driver, acs-service-contract, search, and the package being made searchable relate to each other.
The search package supports two contracts, namely FtsEngineDriver and FtsContentProvider.
  • FtsEngineDriver contains abstract descriptions of operations (search,index,unindex,update_index) that are common to search engines (openfts, htdig,swish,intermedia.) The search package will only make use of one implementation of the FtsEngineDriver contract. So far, only openfts-driver provides an implementation of FtsEngineDriver but efforts are under way to support other search engines as well. The choice of implementation for FtsEngineDriver leads all indexing/searching done by the search package.

  • FtsContentProvider is comprised of two operations, namely datasource and url. The former returns information about a content item (given its id) like title, content, mime, etc and it is used both while indexing and for displaying the results. The latter returns the url of a content item given its id and it is used for displaying the search results.
When inserting/updating/deleting a content item that is being "observed" (triggers have been added), the search_observer_queue table is updated with a record composed of the object_id, event (INSERT,UPDATE,DELETE), and the timestamp. With respect to the event type, we call the appropriate operation from FtsEngineDriver, for example "index". Since operations are nothing more but abstract descriptions, we are actually calling a chosen (a-priori -- in our case openfts-driver) implementation. In the case of indexing, we are choosing the function that does the indexing using FtsEngineDriver implementations. Next, we retrieve the content for an object_id using the implementation (with name the type of the object) of FtsContentProvider. Finally, the chosen indexing function is called with the content passed as an argument.
Collapse
Posted by carl garland on
Kinda weird I wanted to respond by message by K2pts but it isnt showing up in forum yet I recieved posting by email? Anyway I recommend anyone who is interest more in his post check out the book Design Patterns and check out the Abstract Factory Pattern. It is a very good book and OpenACS is becoming more and more of an enginneering feat vs the original web hack ... good work guys.
Collapse
Posted by Clay Gordon on
Neophytos: I have a question about fts that I can't seem to find
and answer to, namely, how are indexes and unindexes
scheduled?

In the first instance, an object added to the collection is not
locatable by fts until it is indexed and the indexes updated. Does
this happen in real time (or real-enough time)?

In the second instance, an object deleted from the collection is
findable until the indexes are updated and all references are
deleted. In Verity, the index updating is a form of garbage
collection that, because it is slow, usually is scheduled
overnight. This means that queries will find objects that are no
longer in the collection for hours -- or longer, depending on when
the next update is scheduled.

Are you familiar with Texis from Thunderstone? It is an ANSI
SQL-compliant RDBMS that has been optimized for full text
applications. It is capable of running real-time news services,
with real time indexing and updating when items are added OR
deleted. This is an ideal situation. How close can we come with
fts?

Thanks,
Clay

Collapse
Posted by Jowell Sabino on
I need a bit more clarification on what "datasource" is (implemented as a tcl procedure above). I can't find the specification of what datasource is supposed to provide, but judging from the code, datasource seems to extract information 1) about the content, and 2) the content itself presumably for actual indexing by the search engine.

Pardon me if this question is silly, but if content is some binary file (like a pdf file stored in file storage, for example), will the content still be indexable/searchable? Or is it expecting too much for binary files (or in general, blobs) to be searchable, too? If binary file contents are not searchable, then "datasource" is limited to information about the binary file, and not the contents of the binary file. In other words, I could search for "files whose filename contain foo", but I cannot search for "files that contain the word foo".

If binary files/blobs are searchable, that is really cool. The implementation of datasource will probably be messy though when the CR is used, since content can be stored in three ways...

Collapse
Posted by David Walker on
I'm thinking that each type of binary file would require some kind
of handler.  For example, if you had a pdf2txt handler then you
could index the content, otherwise you could only index information
about the content.
Collapse
Posted by Neophytos Demetriou on
I need a bit more clarification on what "datasource" is (implemented as a tcl procedure above). I can't find the specification of what datasource is supposed to provide, but judging from the code, datasource seems to extract information 1) about the content, and 2) the content itself presumably for actual indexing by the search engine.
The specification for datasource is in packages/search/sql/postgresql/search-sc-create.sql. You didn't have to know that but your description is correct -- the datasource operation provides information about the content and the content itself. The datasource operation is used both for indexing and for returning search results (constructing the summary of each result). In the future the specification of contracts, operations, etc will be available by the acs-service-contract package.
if content is some binary file (like a pdf file stored in file storage, for example), will the content still be indexable/searchable?
The search package expects one of the following:
  • content holds the filename if storage_type='file'
  • content holds the text data if storage_type='text'
  • content holds the lob_id if storage_type='lob'
The search package use these cases in order to retrieve the content and info about an object. Next, the content is filtered with respect to the given mime type. Currently, only two mime types are supported, namely text/plain and text/html [but your implementations of datasource should provide info about other mime types as well -- we will support them in the future]. Content with unsupported mime type is not indexed by the indexer. So for a pdf file we might only index its title.

As David suggests, for each mime type we require some type of handler/filter. Once the handler is available, i.e. pdf2txt, it is very easy to incorporate this into the search package (adding one line to search_content_filter). I need to do a survey of available filters/converters before we add it as part of the package but individuals can already use this functionality in their projects.

Update: (Dan wrote) For object types that don't use the CR developers can use acs_object_type__create_type, but those that do use the CR need to use content_type__create_type. content_type__create_type overloads acs_object_type__create_type and provides two views for inserting and viewing content data, and the CR depends on these views.

Collapse
Posted by Gilbert Wong on
Neophytos,

The documentation you provided was very helpful.  I hooked a test package into the search engine and it works beautifully.

Is there any way to manually add and remove objects from the indexed search or prevent the indexer from indexing certain objects?  I have several tables which allow content owners to make objects invisible to users.  For instance, the users can toggle whether or not they will release an object for general viewing by setting a column in the table (release_p).  If release_p = f, then I don't want it to be indexed.

What would the best way be to restrict the search engine from finding unreleased objects?

Thanks!

Collapse
Posted by Don Baccus on
General permissions would be the general way to implement visibility restrictions.  Of course we have legacy packages that use ad hoc means to control visibility - Gilbert, are you talking about ecommerce3 by any
chance?
Collapse
Posted by Gilbert Wong on

Don,

Yeah, in the ec_products table, there is a column called active_p. Are you saying that if I don't want to release the object for general viewing, all I need to do is to make sure that I only allow the creator to view the object? After the creator is ready to release the object, then I grant permission to the_public or registered_users to read that object. Is that correct?

If so, I see two possibilities:
1. Remove the column active_p and replace it with a permissions check.
2. Leave active_p as is and add the permissions check to make sure that searches only pick up active products.

#1 would require me to rewrite some queries in the ecommerce package to do permissions checks to see if the product is viewable. #2 would require me to write another dml query to set the permission for the product. Both would require about the same amount of work. Any suggestions as to which one I should implement?

Hello Gilbert, as Don says using general permissions is a general way to implement visibility restrictions on search results (permission checking is on by default). OTOH I have been thinking the past few days that it would be more flexible if I had included a searchable_p in the output of datasource. The observer would then index only those content items where searchable_p is true. If there are no objections I will add this attribute to the datasource operation.
How close can we come with fts?
-- Clay Gordon
Unfortunately, I don't know much about Texis. OpenFTS can be used to index content items in real time. However, in OpenACS we choose to use a scheduled proc instead. This approach causes the problems that you have listed above but it is required in order to support multiple search engines. Of course
Collapse
Posted by Bart Teeuwisse on
Neophytos, could you elaborate on the use of '' as keywords in notes_datasource? I'm continuing the work Gilbert Wong did on searching products in the ecommerce package. The ec_products table has a keywords field that could be used if it applies to this context.
Eventhough keywords are not currently indexed by the search package, I have included that field in the datasource operation so that it would be easier to enhance the search package in future versions. I suggest that you use the ec_products keywords field in the ec product datasource anyway (and the functionality is going to be available in upcoming versions of the search package). I am going to make some improvements on the search package during xmas holidays (including keywords, object type, mime type and package-specific search).
Collapse
Posted by Roger Metcalf on
Are these keywords that can be passed to the search package still not indexed?  What would be involved in making them so?  Is that OpenFTS work, or OpenACS work?  Thanks.
Collapse
Posted by tammy m on
I don't know about the OpenFTS work involved but I can say that static-pages  don't have any keywords associated with them so there would likely be OpenACS work involved for most packages already implementing Search to provide keywords.
Collapse
Posted by Bart Teeuwisse on
Thanks Neophytos, I'm looking forward to the enhancements. In the mean time I'll be adding the search to more packages.