Forum OpenACS Development: Response to Enhanced KM in OpenACs 4.0 -- interface to OpenCyc

Michael:

Let's start with your first two ideas:
1) Query optimization
2) Improved presentation of query result sets

It seems to me that these two are the simplest to think about first, since they need only to interface with search and not any other part of OpenACS.

The query optimiztion requires taking the search request and passing it through OpenCyc to try to figure out the context of the request. Java is a good example: does the requestor want to know about the programming language, coffee, or the island in Indonesia.

One way to assist this contextualization is through the localization of the originating context. This can help with point 2. On the OpenACS site, for example, it is most likely that people are interested in the programming language. On Photo.Net, newbies are probably interested in the Island, but people who are familiar with Philip and Alex's Guide and its ties to photo.net might be interested in the programming language. In neither case is coffee a likely context.

Step 1 requires no knowledge of the contents of the collection. Localization, as described above, does. In order to be able to localize, it's necessary to do Step 3, which is have OpenCyc process the entire contents of a collection so that it "understands" what is in the collection.

In 1998 and 1999 I was looking for money to build a search engine with some very unusual characteristics -- the main ones were collaboration, the inclusion of business rules engines to drive processes like spider scheduling, and incorporation of WordNet to establish context. One example I used was Java. After entering the search request, the user would be presented with something like:

Found 3 different meanings for Java:

  1. a computer programming language (2,000,00 pages, 5967 sites, 19 categories)
  2. a synononym for coffee (116,000 pages, 423 sites, 3 categories)
  3. an island in Indonesia (23,000 pages, 111 sites, 9 categories)

The pages, sites, and categories references were links to presentations of results. The order of the presentation, above, was dependent on the number of pages returned. By having a knowledge of the context of the query, for example, the user is searching for Java from a travel site, then results about the island are almost certainly the ones that were wanted and these would be presented first (or, perhaps, exclusively).

A simple category structure would enable localization, too. If the user searched from within the "travel" category, then the scope of the search would automatically localized to travel. Category organizers, in my way of thinking, should spend as much time thinking about metadata as they do about the category structure. For example, I have a personal interest in chocolate. If I was responsible for the chocolate category at a site that incorporated both categories and a page-based search engine, I would want to create a thesaurus of concepts that describe chocolate to aid a classification engine. With OpenCyc, I would use the OpenCyc tools to do this work using assertions like:

is a type of: white, ivory, milk, dark, bittersweet, truffle, bonbon, ganache
is a manufacturer of: Callebaut, Valrhona, Nestle, Hershey
has(?) holidays: Easter, Valentine's Day
is not related: labrador retriever

I think that this is enough to give you a glimpse of some of the ways I have been thinking about using this.

likewise, how do you picture the "chunking" of results into "contexts"? How is that like or unlike search results pages that are out there now? For example, I'm not really clear yet on how, from a user's perspective "a collection of phrases and keywords grouped into the smallest possible number of headings" differs from Yahoo! or from Northern Light.

One of the challenges with taxonomies like Yahoo!'s is that there is a tendency to want to describe things using a single word. When that happens, related words tend to get separated and entries naturally get segregated. Example, Associations, Organization, & Clubs. To my way of thinking the distinctions between these concepts are not useful when organizing the contents of the Internet -- at least to most users. However, to professional categorizers, they are different, and, while linguistically correct and precise, are a nightmare from a usability standpoint.

My main problem with Northern Light is that, after using it for a while, you'll notice that the range of concepts it uses for its categories is quite limited, is often self-referential (e.g., there is a chocolate "custom search folder" within the chocolate results set), and the confidence ratings are often nonsensical (e.g., the first result within the chocolate custom search folder for chocolate has a confidence rating of only 85%).

From my POV the purposes of chunking are to quickly eliminate the contexts you know don't apply, and to present contexts you may not have been aware of (serendipity). NL doesn't do this for me, nor does Teoma or WiseNut.