Forum OpenACS Q&A: .LRN Search thoughts and architecture - Please give feedback

(This is the result of an electronic discussion in between Dave Bauer
and me. I wrote the summary)

The PostgreSQL search package works fine, so for the implementation of
.LRN search, we want to leverage what is there.

The operations which need to be supported for this release are:

* retrieve items for a list of given search words
* Retrieve html markup for a given id
* index an item: insert, update, delete items

Currently the service-contracts in search provide these contracts:

* FtsEngineDriver
* FtsContentProvider

FtsEngineDriver has these operations

* search
* index
* unindex
* update_index
* summary
* info

FtsContentProvider has these operations

* datasource
* url

Currently these service-contracts are implemented with the
acs-service-contracts package, but we want to port them to the
callbacks provided by OpenACS 5.2 To achieve this under .LRN 2.1.1 the
callbacks need to be backported to .LRN 2.1.1. The backported code
will be isolated in a package e.g. dotlrn-backports.

Intermedia has its own indexing queue, this code can be safely removed
from the Oracle search package.

The search operation needs to be a bulk operation, you should be able
to pass in words (and a range) and you'll be returned items ready to
be displayed. There should be a "return marked up HTML" operation
called from within the search procedure. Whereas search definitely
needs to be a bulk operation, this may be a few single queries - it
should be tested on realistic data.

The "index an item" operations should not be run within the database I
think, but with callbacks. In the long run (and mid and short run)
this is probably a lot more extensible and maintainable.

Here are the initial ideas I had on callbacks:
ad_proc -callback search::datasource {
    -object_id:required
} {
    This callback is invoked by the search indexer when and object is
    indexed for search. The datasource implementation name should be
    the object_type for the object.
} -

# define for all objects, not just search?

ad_proc -callback search::index {
    -object_id:required
    -datasource:required
} {
    This callback is invoked by the search indexer. It will dispatch
    to the full text search engine. Additional optional paramters may
    be added to support additional features of full text search engines.
    @param object_id
    @param datasource A list in array get format of 
    content,title,description,keywords (a list of keywords or
    category names), syndication (list in array get format of 
    information suitable for XML syndication)
} -

ad_proc -callback search::search {
    -query:required
} {
    This callback is invoked when a search is to be performed. Query
    will be a list of lists. The first list is required and will be a
    list of search terms to send to the full text search
    engine. Additional optional lists will be a two element list. THe
    first element will be the name of an advanced search operator. The
    second element will be a list of data to restrict search results
    based on that operator.
} -

ad_proc -callback search::unindex {
    -object_id:required
} {
    This callback is invoked to remove and item from the search index.
} -

#####

ad_proc -callback object::url {
    -object_id:required
    -operation
} {
   This callback is invoked when a URL needs to be generated for an
   object. Usually this is called from /o.vuh which defers URL
   calculation until a link is actually clicked, so generating a list
   of URLs for various object types is quick.
   @param object_id
   @param operation (optional) Defaults to standard "view" 
    url for the object within the current subsite. 
    Operation to generate the URL for. Examples
    "view","edit", "admin" 
} -
I am not sure how exactly the -operation switch to the url callback should behave. Should it have a list of predefined valid operations? This list should allow full use of all existing features of the search package for PostgreSQL. We probably should try to imagine future requirements and plan for them in the callback inteface to allow for easier development.
I forgot to mention the additional features I want to support n the future.

Most important is allowing search on additional columns besides the full text index. Right now search supports searching on the creation_date of an object (number of days old) and Jeff added search based on the subsite where the containing package is installed.

I think we should support arbitrary additional columns through an optional parameter. I have implemented this for a couple of client projects and it seems to work reasonably well. I think added a callback definition of the additional columns would be approppriate, then the package that needs the additional column support could implement the callback and pass the require query information to the search driver.

One example would be searching an individual forum in the forums package, or underneath a specific sub-folder in file storage or any CR-based package.

Another place I have used this type of feature is searching a package specific category tree.

Hi,

here is a summary from the issues we found developing a search archtiecture for the guys at the HP professional printer division.

Permissions:

Only a fraction of the document result set from a search query may be visible to a particular user. In the HP case we had to consider search result sets with a million to a billion document IDs. What's important in such a context is a fast way to discard documents that a user can't see to avoid slow searches (hierarchical or via wildcard) over document permission hierarchies.

To tackle this issue we've come up with several options. The best one for the HP context was to include a fast (denormalized and precalculated) index on "projects" (the topmost containers of documents and document trees). So checking whether a user has read permissions on the document's project allows to discard 95% of all non-allowed documents in the result set.

Stemming and Normalization:

Big issue outside of the US. "Bäume" is the plural of "Baum" in German ("trees" and "tree"). Intermedia solves this issues by allowing the user to specify a stemming table that would translation "Bäume" back to "Baum". Works pretty simple during indexing time.

Search by Proximity:

Looking for "search engine" should return documents first where "search" and "engine" are following each other, and not a document where somebody talks about an "engine" and later about how to "search about an expert". => You'd need to include the position of words in a document in the search index.

Ranking in General:

The current OpenACS search is extremely poor, because it doesn't do any ranking of documents. I use Google to look for stuff in www.openacs.org...
=> You'd need to provide means for users to plug-in their own ranking algorithms.

Just some thoughts. Maybe you want to checkout
http://www-db.stanford.edu/%7Ebackrub/google.html
in order to learn about some _real_ stuff...

Bests,
Frank

Frank

These are interesting issues, but really unrelated to the definition of an interface for the search package. These are important issues to consider for the implementation of a specific search driver.

Re: OpenaACS.org, due to a silly bug on my part in developing the tsearch2-driver, the results were ordered in reverse relevance order, with the least relevant result first. This has been fixed for months now thanks to Jeff Davis.

Tsearch2 by default does stemming, and support multiple language dictionaries, but does not do proximity, I still think it returns pretty good results now. They results on OpenACS.org are much better than with the OpenFTS based search. One thing that would make it better is forums only search, for example.

THanks for your insights, we will definitely keep this in mind. The reason to support multiple search drivers is to allow more flexibility in how search is done.

Hi,

Dave wrote:

These are important issues to consider for the
implementation of a specific search driver.

We've just finished ahead of time a timesheet billing system for Project/Open, so that we are going forward with the implementation of a search engine now. So I've looked around a bit more to come up with a concrete architecture.

Here is what I've found:

- The "search" module together with the "tsearch2-driver" module contain a total number of 695 lines of TCL and SQL code.
- It was surprisingly difficult to enable searching for a simple Project/Open object such as a "Project".
- The "txt" table has a really ugly name and only contains two columns. However, we would need a lot more search-relevant information
- The current API is not sufficient for our purposes. We would need to pass around a lot more information between the application modules and the search module.
- Returning a list of lists for search results doesn't suit the needs of P/O, in particular because permission checking is extremely performance critical. However, permission checking is specific to each business object, so we will need to mix ("tightly integrate") application specific code with code from the "generic search engine driver". Creating a generic API for this purpose will be very complex.
- We will need support for "popularity" or objects (for example by analyzing web server logs or by counting the number of "hits" of the object). I haven't seen any support for this in the current code.
- I don't really understand why search needs the "observer queue", separating (in terms of execution time and application context) the search packages from the business logic. Updates of the "txt" table should be pretty fast, right?

Conclusions:

- We will need to add search performance optimization based on each object's "business object container" (typically the object's project_id). The reasoning goes like that: If the user doesn't have permissions on the project then he won't have permissions on the "contained" objects such as discussions and documents. For this purposes we will have to add an additional "container_object_id" column in the table "txt".
- "Filestorage documents" and "forum items" in P/O are not OpenACS objects (for performance reasons). However, they are also the most interesting objects to search for. So we will need a "txt" table with a primary key composed of object_id and object_type.

I hope that it becomes clear from the above reasons that we are probably not going to go with the "search" and "tsearch2-driver" packages for P/O searching. Instead, we are probably going to build a P/O specific search package based on the code of these two modules, but implementing all the additional stuff that we need.

The resulting search package is - obviously - going to be GPLed, so I hope that it may serve as an input to a future (generic?) OpenACS search package.

I'm not at all sure about these conclusions. I'm just writing in a relatively "provicative" style (you may already have noted it from previous postings...). So please argue with me and prove that I'm wrong.

I would be delighted... :)

Bests,
Frank

Hi,

just got the first prototype running. Here is the table definition for the main table:

create table im_search_objects (
        object_id               integer
                                constraint im_search_objects_object_id_fk
                                references acs_objects
                                on delete cascade,
                                -- may include "object types" outside of OpenACS
                                -- that are not in the "acs_object_types" table.
        object_type_id          integer
                                constraint im_search_objects_object_type_id_fk
                                references im_search_object_types
                                on delete cascade,
                                -- What is the topmost container for this object?
                                -- Allows to speed up the elimination of objects
                                -- that the current user can't access
        biz_object_id           integer
                                constraint im_search_objects_biz_obj_id_fk
                                references acs_objects
                                on delete cascade,
                                -- counter for number of accesses to this object
                                -- either from the permission() proc or from
                                -- reading in the server log file.
        hit_count               integer,
                                -- Full Text Index
        fti                     tsvector,
                                -- For tables that don't respect the OpenACS object
                                -- scheme we may get "object_id"s that start with 0.
        primary key (object_id, object_type_id)
);

create index im_search_objects_fti_idx on im_search_objects using gist(fti);
create index im_search_objects_object_id_idx on im_search_objects (object_id);

The following code adds a new object to the search index:


create or replace function users_tsearch ()
returns trigger as '
declare
        v_string        varchar;
begin
        select  coalesce(email, '''') || '' '' ||
                coalesce(url, '''') || '' '' ||
                coalesce(first_names, '''') || '' '' ||
                coalesce(last_name, '''') || '' '' ||
                coalesce(username, '''') || '' '' ||
                coalesce(screen_name, '''') || '' '' ||
                coalesce(username, '''')
        into    v_string
        from    cc_users
        where   user_id = new.user_id;

        perform im_search_update(new.user_id, ''user'', 0, v_string);
        return new;
end;' language 'plpgsql';

CREATE TRIGGER users_tsearch_tr
BEFORE INSERT or UPDATE
ON users
FOR EACH ROW
EXECUTE PROCEDURE users_tsearch();


I can't really see why there should be something "bad" about it right now...

Bests,
Frank
Frank,

This table looks interesting. I think figuring out which parent object to check for permissions is very helpful. I just talked to some people at the meeting in Madrid who used the same strategy. The biz_object_id, which is the topmost container would be useful. In OpenACS this would probably be a content folder, a package instance, or a subsite in most cases.

The search observer queue is to keep search indexing running in a seperate thread to not slow down object creation. It works fine and the abstraction is not a problem. I also think the new callback based procedures for datasource make it more flexible and much easier to define a new object type for searchability.

We had a meeting on search in Madrid, I will be posting the notes in the next day or two.

Hi,

which parent object to check for permissions

Unfortunately it's not that easy in the case of Project/Open. We can conclude from (no permission to see the parent) => (no permission to see the child), but not the other way around. We will have to check all objects individually afterwards. Anyway, most users will only be able to see a small number of containers, so it will be a good selection criterium.

Also, there are exceptions in P/O. There may be files in a project that are stored in a special "public" folder that should be available to all "Registered Users", "Employees" or "Sales Staff" for Knowledge Management reaons. So we'll have to extend our initial design above with a bitset for these profile-based permissions.

These are the reason why we can't just return an array of data from the search module to the display screen. It would just crash the system if we'd check potentially millions of entries. This also means a "no" for any service contract based architecture, if I see that right.

HP's professional printer division had a similar problem with a past version of their Intranet. This issue finally led them to abandon the project alltogether and to go with M$-Sharepoint. So permission checking and its performance implications must be integrated into the search system right from the beginning...

Bests,
Frank

Frank, permission checking has been an issue with OpenACS4X right from its inception (as you know)

In intranet search and in site design in general you need to consider whether your content is public or private by default. OpenACS is a general framework for building internet and intranet sites, there is (maybe unfortunately) not a builtin assumption on whether content is rather private or rather public. Even for a production .LRN site I cannot say that yet.

Now the rationale behind my approach is that the most restrictive query clause be applied first. Which one is the most restrictive clause? Depends on the query terms and on the content. In case of the HP Printer division, if you search for "HP Printer" *all* documents may be relevant, right? I think in terms of query tweaking there is no good starting point, so you may want to change the user interface, so it is much harder to run a query "HP Printer" on the complete content. E.g. if you search by default "in this community" you have a much, much smaller potential result set. (That'll be turned on in .LRN search by default)

Also we are not going to show the number of documents that match your query. You'd need to permission all the items in your result set regardless of whether you've reached your "limit" (e.g. the first 10).

What made the the search in the HP printer division intranet so complex?

What made the the search in the HP printer division
intranet so complex?

The quick answer: The fact that permissions were inherited from super-folders, so that you'd need a hierarchical query in Oracle.

Hierarchical queries are basicly executed iteratively one-by-one, even if they look like normal SQL queries.

Even for a production .LRN site I cannot say that yet.

Even the different .LRN sites may want to use different security policies. Is the course documentation going to be publicly searchable? And: There may be more then a black & white distinction between public and private. What about "inscribed alumni" against "registered users"? Or teachers? Or Directors/Managers?

The case with Project/Open is so complex that we will need to check the permission of each element using the standard TCL security checks at the end (the final results screen) to ensure watertight permissions. Its a software consistency issue.

And we can do this, if we are only showing 10 or 20 results for each screen, if the "list of candidates" (the results where the user has access to the container objects) is reasonalbly and if these results are reasonably ranked. So that is what we're going for.

However, I repeat again, you won't be able to do all of this via a service contract architecture. The final security check on the result screen needs to be application or even implementation dependent.

Bests,
Frank

Frank,

If we run acs_permission__perission_p on 10 objects, the results should be fast enough. I don't see any reason to add any additional security checks. Right now the code in tsearch2-driver does the OpenACS standard "if exists" where clause that checkes the acs_object_party_privilege_map and that seems to perform for the number of objects on a site such as OpenACS.org.

OpenACS itself does not need to support permissions systems that do not exist in OpenACS. Checking if the viewing user has read on the actual object returned in the results will always be sufficient.

I don't think there is any plan to defer permissions checking to a callback procedure, although in the case of project open, if you are doing the check in the Tcl layer, and need addtional flexibility beyond the OpenACS permissions system, a callback would seem appropriate.

There are .LRN installations with a greater number of objects than OpenACS.org and we will be testing the performance on that size system, so as we are developing we will have more accurate performance statistics to base our decisions.

Hi Dave,

I don't see any reason to add any additional security checks

That's correct. It's only that P/O has several additional security features that are not know available in the general OpenACS platform. Obviously these features result in custom code.

However, there is one main feature that you might consider in .LRN: "Profile-based permissions". For example if you want to grant "University Employees" permission to access all course material. Or to "Staff Professors".

"Profile based" permissions work really differently from "Object based permissions" (the only way to express security in OpenACS):
- Object based permission require a specific entry in a database table for every group-object relationship. Or an equivalent recursive lookup query. (This is a little bit like forward against backwards inference in artificial intelligence reasoners. Do you remember the 80'? 😊 )
- Profile based permissioin work without explicit reference to an object (there is no "object_id" in the permission tables). You can say: Every "Employee" should have access to all course material.

The "generic implementation school" (I'm not going to name its members in the OpenACS community... 😊 ) will argue: "You can implement all profile-based permission using object-based permissions". That is right. But granting object-based based permission for 100,000 documents to 10,000 users will create some 1G records in the denormalized permission cache or the equivalent number of recursive queries if the relationship is not denormalized.

If we run acs_permission__perission_p on 10 objects,
the results should be fast enough

That's OK and that's great. However, that means that you have to run acs_permission__perission_p at the "result page" and not in the "tsearch2_driver" in the service contract routine. (Because otherwise you would have to run the security check on _all_ results).

The HP printer guys just ran into that trap: They had to run the permission check on the entire result set of 10 million records, resulting in about 20-40 CPU seconds of a _seriously_ big Oracle server...

The situation in Project/Open is even a bit more complext, because we have different "container rules" for different object types. So we need to be flexible even in the main TSearch2 SQL query in order to eliminate entries as early as possible in the database execution path.

Bests,
Frank

One note on displaying items. You should use listbuilder and make use of pagination where the user already has the option to change the number of objects displayed.

Furthermore, you should add the permission check to the pagination query.

Last but not least, I thought that acs_permission__perission_p was deprecated and we should use

and exists (select 1 from acs_object_party_privilege_map ppm
where ppm.object_id = :project_id
and ppm.privilege = 'read'
and ppm.party_id = :user_id)

instead.

Hi Malte,

Listbuilder may be OK to render the list of 20 or so results for each page, but I would highly disencourage the use of the pagination feature of listbuilder. That's the main point in the last part of this discussion: You don't want to spend time checking the permissions of items that you are not going to display in the first place.

Project/Open has its standard way of pagination using "limit" and "offset", which would avoid the issue above. But I could even imagine a different approach with eliminating the object_ids of the previous pages by passing them around on the HTML level... (well, not that in serious, but you get the point with my preocupations in terms of performance?)

Bests,
Frank

Frank, it becomes obvious you haven't worked with listbuilder before :). Listbuilder's pagination query is using "limit" to limit the number of object id's that are returned. So you are *not* checking permissions for all items even those you are not displaying in the first place. A snippet from list-procs-postgresql.xql:

  <fullquery name="template::list::prepare.pagination_query">
    <querytext>
      $list_properties(page_query_substed) offset [expr $first_row - 1]
      limit [expr $last_row - $first_row + 1]
    </querytext>
  </fullquery>

And therefore the Listbuilder Approach is in my opinion more flexible than yours, but who am I to argue 😊.

Hi Malte,

thanks for your comprehension... I'm from East Westphalia region, we are know as the most subborn Germans around. Anybody else to compete?

However:

- Listbuilder _does_ establish the number of all elements in the query, doesn't it?

- Listbuilder probably doesn't allow to plug-in some custom TCL routines that do a final check if an item should be displayed, does it?

We're using listbuilder in P/O in several areas now, in particular for maintenance screens where the specific layout doesn't matter very much and when there is no need for application specific tweaks...

Frank

I am pretty sure that on pg the whole result set is constructed before you apply limit and offset (certainly it is when you sort).
Hi,

I am pretty sure that on pg the whole result set is
constructed before you apply limit and offset (certainly
it is when you sort).

Thanks a lot, I'll check. The main point is probably that there are no "iterative" operations (which include PLpgSQL statements, recursive queries or wildcard searches) before the application of limit & offset. However, a hash-join with the list of accessible business object containers should be OK and should discard > 80% of all entries for any user except for some senior management guys or administrators who can see all projects...

Nice playground for database afficionados, though :)

Frank

I posted discussion from Madrid. https://openacs.org/forums/message-view?message_id=291914

We want to think about how we can add in support in the callback definitions for these future capabilities.

Just to le you know:

We've got our first "Intranet" search engine running. You can see the online demo at http://ptdemo.dnsalias.com/. Log in as "Ben Bigboss".

The permission system finally uses a multi-stage filter architecture:

1. The SQL query uses a "join" to discard items that belong to "business objects" that are not visible to the current user. This is a relatively cheap procedure that can be applied to a huge result set effectively.

2. The remaining objects are joined and ranked by TSearch2

3. What's coming out of this is "paginated" by a "limit - offset". PostgreSQL may still generate the whole result set, but atleast that stays inside the DB.

4. The TCL "db_foreach" look contains object specific calls to the (TCL) permission routines for each object iteratively. We actually receive "limit"*10 rows from the DB in order to be sure to return the whole "limit" number of rows.

We still need to test this with larger result sets, but it already looks quite ok.

Bests,
Frank

Fast indeed. Something's funny though, only the trimmed index terms are searchable. E.g. search "activ" versus "active".
only the trimmed index terms are searchable

Yes, there are some things still going wrong. I believe the TSearch2 configuration is still for Russion or something like that. There are funny words appearing in the full text index ...

Fast indeed

It's running on a 512MByte 1.5HGz AMD box, together with 4 other virtual servers. The box is swapping a lot, so I hope it's going to become considerably faster when there is more memory...

Bests,
Frank