Forum OpenACS Development: OpenACS wish-list

Collapse
Posted by Chuck Rouzer on
Here is a small wish list for OpenACS:

Primary Requests: Works flawless with PostgreSQL v7 and Apache v2.

Optional: Easily integrates with applications like GNUenterprise
  www.gnue.org  I've even considered working on a CMS for GNUe.

I look forward to hearing any comments if any.  The current Ars Digita
situation is unfortunate, hopefully something good will come out of it.

Collapse
Posted by Don Baccus on
By "works flawlessly" I assume you mean that no unported queries leak through our sieve?  I think the query dispatcher and query extractor tools will help with this, along with the fact that we have more people resources available, including folks who are willing to help with testing.  A very small number of folks did nearly all of the first port, and while we did test as much as we could given the amount  of time we had to give to the project (and, I might add, fixed a lot of aD bugs on the way) some queries did leak through.

Particularly in some of the larger, more complex modules like e-commerce and intranet.

4.x will only work on PG 7.1, BTW.  Given the amount of code stuck into the PL/SQL aka PL/pgSQL layer we need NULLs to work correctly in parameter lists, and the fact that PG 7.1 also has outer joins is important, too.

As far as Apache goes, Petru Paler is picking up mod_aolserver and is going to work on integrating it with Apache 2.0.  In fact, we've set the sources up in the CVS tree, you can pick them up via anonymous pserver at openacs.org:/cvsroot by doing a "co mod_nsd".

Integration with other applications is a far more difficult technical issue.  People interested in such integration are probably going to need to form a subgroup and study the problem.  How do you integrate two unrelated complex datamodels (the ACS datamodel's complex, I presume GNUenterprise one is as well) and make them play together?  That's pretty much the major issue, I think (just off the top of my head).

Collapse
Posted by Chuck Rouzer on
Yes!  Query tools, people, v7.1 or greater, mod_nsd, sounds good.

Well you can hack together centralized user administration between applications.  But its nice to have options provided for initial installations.  Maybe this is partly already available with OpenACS?
For example, via system/LDAP, database, database table, etc.

When moving to further integration it requires another level of communication.  Such as CORBA or SOAP.  The GNUenterprise Application Server uses CORBA to communicate with clients.  It could be interesting to combined the two when ready.

Collapse
4: Gnu Enterprise (response to 1)
Posted by Albert Langer on
Chuck, thanks for the reminder about Gnu Enterprise.

Seems they have got a *lot* further than last time I looked, though
they are still nowhere near ready yet. That may actually
make it easier to pay some attention to future integration
issues. Their approach has been to do a lot more specification
of an integration architecture up front before delivering
working code than is usual for free software projects. A fair
bit of detail concerning the complex problem of integrating
separate data models is in their docs:

http://gnue.org/index.cgi/docs?package=

Actually doing much about it would of course end up, as Don
pointed out, needing a separate sub-project. But perhaps since
Chuck is familiar with GnuE something could be started preparing
for this now, with Chuck and others interested who aren't able
to help with the intensive core port analysing the requirements and
issues while others are working on the core port.

I'm rather worried about the possibility of the ecommerce
side becoming stuck in limbo for a while, with a working
version of ecommerce 3.x that can't be taken much further
while what can be done with the skeleton for ecommerce 4
remains unclear for a period that could perhaps be prolonged
in view of developments at aD.

A multiple vendor version of ecommerce will certainly be
needed and it will also need to define interfaces for
working with internal and external fulfilments, accounting and inventory.

The skeleton for ecommerce 4 provides a good basis for
major aspects of the multiple vendor side and the Gnu
Enterprise stuff looks like a good approach to such
internal and external interfaces. Includes specifications
for such things as how to handle multiple currencies etc
that can come back to haunt one later when not taken into
account from the beginning. Related to that is the very nice
separation of the geographic tables in the ACS 4 version of
ecommerce 3, which could probably plug right in to GnuE as is.

As far as I can see integration with GnuE stuff only relates to
the narrowly defined ecommerce aspects and not to anything
in the core (though of course it has to take into account the
parties model in the core). I'm assuming that there's no point
attempting to integrate at the Web UI level for end users, though
some admin pages might be just as much part of an "enterprise"
system as they are part of a "web" system.

Perhaps some liason could be established between such an
ecommerce subgroup and the GnuE project.

Chuck - any chance of you kicking this off with a more detailed
analysis of GnuE issues in a separate thread and/or file storage
document, eg including specific extracts and references to GnuE
docs?

Collapse
Posted by Simon Carstensen on
Request: a better search engine.

I know this has nothing to with porting ACS to OpenACS (since this is a upgrade instead of a port), but the current search capabilities of ACS are pathetic!

like upper('%$QQsearch_query%' just isn't good enough.

Yours,

Collapse
Posted by Don Baccus on
There are several possible paths to a better search engine (which we know isn't adequate).  Two possibly synergistic efforts include an in-database search engine being developed by folks who run a Russian portal.  If someone's interested in tracking this down and evaluating it both for usability and completeness (i.e. are they done yet?) e-mail me.  I know that to make it work very well on their portal site they had to make a change to the PG core's optimization and evaluation  of limit that Tom Lane didn't like all that well.  They were going to  work together for a reasonable 7.2 solution last I heard.  Might be worth checking into.

The other possibility is swish++ or something similar. There was someone working on a swish++ module for OpenNSD but I don't know how far they got.  This wouldn't be the hard part, the hard (but not terribly hard) part would be to rewrite swish++ so it would take and index content from dynamic content generated via URL as well as from files.

Swish++ has some phrase-based search capability, unlike the older swish.  The older swish wouldn't really be significantly better than the current search function (in terms of the search results, it would return equally bad results much more quickly though).

Collapse
7: Searching using Swish++ (response to 1)
Posted by hafeez bana on
Collapse
Posted by Don Baccus on
That looks like it...anyone want to play?  It looks like you'd have to index static content and dynamic content in two separate steps.

I have no idea how well this works in practice, BTW.

Collapse
9: Swish++ (response to 1)
Posted by Michael A. Cleverly on
I'm just finishing up work on moving a site from AOLserver 2.3 & Solid to AOLserver 3, ACS 4 (Tcl), and Oracle. Rather than mess with Intermedia we've decided to use Swish++. (Once our two year Oracle licenses expires we hope to have moved to Postgres and didn't want to get locked into too many Oraclisms. Also, it looks like people have to fight lots of battles with Intermedia to get it working, and I've never been particularly impressed with the search results it returns on ArsDigita's site--and I assume they know how to configure Intermedia better than I would...)

Now that it (Swish++) can support tcp/ip there's no need to patch AOLserver to use Unix domain sockets. Both indexing and searching are extremely fast. We have written a cron job to periodically dump stuff out of Oracle and into the file system were we feed it to Swish++.

If you wanted, you could use wget & the Swish++ indexer to spider your site, indexing dynamic and static content at the same time.

Collapse
Posted by Don Baccus on
Michael - you should write a HOWTO on how you've done this.

I have no doubt that I can figure this out on my own (and many here probably share that feeling) but I do have doubts about having time to do so in the near future.

So a simple step-by-step HOWTO would be great.

How good is SWISH++'s phrase searching capability?  What are the performance considerations involved in having it index Oracle (and later PG)-generated content?

We know that InterMedia is butt-awful slow in upgrading indices and real sites do so periodically rather than on per-content insertion...so periodic SWISH++ index updates are acceptable.

Lastly ... what about indexing non-text, non-html content?  Reaching down into word etc?  Is there any work done using open-source conversion code to enable this?

Collapse
Posted by Louis Zirkel on
In response to your last point about Word documents Don, swish++ comes with a program called extract which can be used to index binary type documents such as Word documents. From the README file:

  6. Index non-text files such as Microsoft Office documents
     A separate text-extraction utility "extract" is included to
     assist in indexing non-text files. It is a essentially a
     more sophisticated version of the Unix strings(1) command,
     but employs the same word-determination heuristics used for
     indexing.

It's not the most elegant solution, but it seems to me that it would be something that would be workable. I would think that you could also use something such as antiword to convert a Word document to text and then process it using the normal text indexing features.

Collapse
12: word doc parsing (response to 1)
Posted by Erik Rogneby on
we use wvWare to parse word docs.

It's pretty decent. It will covert to html or text, and also rip out embedded pictures. I had to hack on the config files of the earlier version we use, but it's been updated since then and seems well featured.

For what it's worth the parsing instructions/configs are in XML.

Collapse
Posted by Jerry Asher on
Michael,

Did you evaluate SWISH-E? If so, why did you choose SWISH++ over SWISH-E?

I must admit that though SWISH-E is rumoured to die on indexing at times, I am nonplussed by SWISH++ requirement on using C++ and making me (still running RH 6.1) have to upgrade various gcc libs.

Beyond that however, SWISH-E looks to have better search capabilities and better indexing features, including an Intermedia-ish way to index right out of the database.

Collapse
Posted by Don Baccus on
SWISH-E 2 does seem to be a lot better than the earlier SWISH I compared to SWISH++.  At that time it was missing phrase-search functionality, the ability to take data generated by an external program (which is how they're hooking to databases, apparently), and the ability to use filters.
Collapse
Posted by Michael A. Cleverly on
Jerry, before reading the fifthgate.org article above, my indexing was very cave-manish. (A database table containing a document id and a word, with queries to retrieve the list of documents built up as select document_id from search_index where search_term = 'foo' intersect select document_id from search_index where search_term = 'bar' etc.)

Sounds like Swish-E could be a lot better. Compiling Swish++ initially was a pain (RH 6.2). Thanks for the link.

Collapse
Posted by Jerry Asher on
After poking at it during the weekend while not playing with the kids, I found one reason to prefer SWISH++ over SWISH-E as it exists today.  Phew as in stinky code! As in, I completely understand the comments from the SWISH++ author as to why he needed to rewrite the original SWISH.

It may be the past year's experience with AOLserver has left me with high expectations regarding open source coding, but the basic SWISH-E code is poorly commented, poorly structured, and has no separation of anything.  No separation of search client from index client (it's the same program with different parameters), no separation of library interface headers from actual headers.  There's no encapsulation.

Maybe I'm missing the point, but their "search" interface is evidently not a read only operation, making the developers believe that an opened index can't be searched from simultaneously from two different threads.

I'm going to spend another day or so playing with it and will most likely use Rob's threadpool services to make a AOLserver friendly, stable, encapsulated interface.  Why?  Because if they're doc is to be believed, the basic system does appear pretty powerful.

If worse came to worse, and search was good but crashed aolserver a few times per day, I would consider just creating an aolserver/swish standalone webserver.  That is, isolate it completely from the aolserver/acs server, and use a variety of techniques to communicate with it (ns_httpget or XML) and other techniques to restart it when it croaks.

Well it's open source right?  (I.e., I won't complain too much, I will see if I can make it better with a reasonable effort.)

Collapse
Posted by Talli Somekh on
There has been no mention of htDig in this thread. is there a reason? Do people simply prefer not to use it or is there a concrete, technical reason?

talli

Collapse
Posted by Jerry Asher on
Within my time constraints I was planning on evaluating both.  Having run across Krzysztof's article on fifthgate, I started out looking at SWISH++ (why reinvent the horse?)  That's taken me to SWISH-E (since I may be idiot enough to play with dubious quality code before undertaking a gcc/glib/all that stuff upgrade).

Next up, I hear I can get some of google's original code at a warez site, and somewhere along the line I'll take a look at htDig.

No dig was intended by it's absence. (And if "dig" might mean "htDig" then that sentence has two completely opposite meanings.)

Collapse
Posted by Kapil Thangavelu on
i was curious about the possible reasons for wanting to use swish-e
vs. swish++, afaics swish++ features more modular code and faster
indexing and searches, with better file filters, and a daemon search
mode. what are the compelling reasons to use swish-e?

looking over the non-features list at the bottom of
http://homepage.mac.com/pauljlucas/software/swish/features.html

i don't see anything obvious.

Collapse
Posted by Jerry Asher on
As I mentioned, I am taking a look at SWISH-E in my spare time, because at one time, I didn't want to go through the hassle/risk of upgrading gcc et. al. on my machine, but after looking through the code in SWISH-E, I am rethinking my priorities.

I would like to have a search page that returns either (a) an abstract of each document, or (b) a piece of the document containing the search words.

Perhaps the most compelling reason for using SWISH-E is that
returning an abstract appears to be relatively straight forward operation (you do have to create the abstract as part of a document's meta tags and store it within the index), and undoable w/o a great deal of enery at the moment with SWISH++.  (b) appears undoable with either SWISH-E and SWISH++.

While I am not fond of the code I've seen in SWISH-E, I find the documentation on how to run and setup the system much more complete.

A brief glance suggests that SWISH-E appears to be "one genuine open source project" complete with developers mailing list and CVS.  While it's GPL and while there is a yahoo mailing lists, SWISH++ appears to be much more a one man show.  Frankly, that may be better!

With no other knowledge, it's a crap shoot for me to tell where the time is best spent.  Which project will do better in the long run?  The one with the most features?  The best code?  The open source methodology?  The one man visionary?  Dunno.  What are your thoughts?

Collapse
Posted by Krzysztof Kowalczyk on
I'll chime in as I love philosophical questions.

Ultimately the best solution is sth. integrated with the database (I don't like the idea of setting up and maintaining yet another program). From the out-of-database solutions I found SWISH++ to be not perfect but good enough to make it work (and competition is not any better). PJ Lucas (SWISH++'s author) is a reasonable fellow, he incorporated my patches for daemon search mode and that's why I went with it (I believe swish-e has to either be exec'ed or TCL language bindings would have to be developed).

At the time I was looking at swish-e (a year ago) developement seemed very slow and I believe they didn't make a new version since.

I was not thrilled with SWISH++'s use of C++ but it has an amazingly compact index (around 10% of originals while it's not uncommon for this to be >50%) and I was even entertaining an idea of writing C code that would search from the index (==reimplementing). Not a rocket science but ultimately upgrading libc and gcc seemed like less work.

Anyway, in the context of OpenACS what I consider to be a challenge is designing a good integration of search with the acs framework which means a clean interface that could be implemented with different search engines and provide (and document) one good implamantation with an arbitrarily chosen engine. My view is that as long as it does the job adequatly, search engine doesn't matter. Unfortunately at the time I was looking at this I attacked the problem from the wrong side (the tool) and lost impetus before attacking the real issue.

Collapse
Posted by Jerry Asher on
Some notes from a few hours of evaluating htDig....

htDig development cycles seem pretty long too. 3.2.0b2 in April 2000, and 3.2.0b3 in February 2001.  I can't figure out which one of these three projects doesn't need CPR.

htDig is cgi and C++ based.  It builds under Red Hat 6.1 w/o needing the upgraded libraries that Swish++ needs.

It does excerpts automatically, so from that aspect may be a better choice for folks that want excerpts than either SWISH projects.  You can feed it from external processes, but at first glance that may make for a very very slow indexing.  I may experiment with that.

It does stemming as do the SWISH engines, but has no explicit wildcarded searches.

Regarding a flexible, driver based search interface for AOLserver and the ACS, I agree, that would be a very nice thing.  I was thinking last night that it would be pretty easy to hook up the SWISH engines with a database driver, but that doesn't really get anyone anything.  It would be an easy thing to do to make a variant of the database driver interface to fit any search engine, and the initial SWISH APIs look like a good model: with ns_search instead of ns_db, we might have openindex, specify search, getmatch (a looping construct like ns_db's getrow), closeindex, and maybe geterror and some others such as getMatchDetail which might take a document id that has been matched and return an ns_set of properties.

But that appears to be only one half the coin.  The other half is an ACS module API.  It would be great if a interface/program could be made specifying module callbacks (much like the ad_new_stuff does.)

I don't have enough experience with these search programs yet to know the right form of the interface, but presumably each module could provide a callback function that would take a package id, a date, and optionally a "cursor" cookie (that was returned from the prior call.)

hi jerry,

i'm just wondering if direct access to the index is the best
approach, swish++ does a bit of optimization in its treatment of
indexes that would have to be duplicated in aolserver.Keeping with
the db analogy might it not be better to connect to the swish socket
search daemon via an ns_search interface along with patching swish++
to use persistent socket connections.

most of the useful docs i found for swish++ were the man pages which
i found pretty exhaustive, i've also noticed that paul tends to be
quick to respond to questions on the mailing list.

i haven't used this feature but swish++ also does meta
extraction/searching, and i think you could define the abstract as
the title (stored in the index) but it would require jumping through
hoops for non html/xml docs or writing a custom filter.

cheers

Collapse
Posted by Krzysztof Kowalczyk on
This is at least second time I see people suggesting persistent connection to SWISH++. Persistent connections are needed to make things fast where opening a new connection is slow. This might be the case for a database but not for SWISH++. I believe you can bombard SWISH++ on a local machine faster than anyone on the internate can bombard your site (argument 1: SWISH++ is fast enough) and you probably wouldn't even notice the speedup if you did persistent connection (argument 2: early optimization is the root of all evil).

In short: persistent connection would most likely be a wasted effort.

Collapse
Posted by Jerry Asher on
Kapil, I'm not exactly sure what you mean by "direct access to the index."

I think it's very important to have an interface where the indexing program has direct access to the databases.  As an experiment, last night I directed htDig to index the openacs bboards.  While I left all the spidering parameters at their default the result was that it took an hour an ten minutes to index 2898 documents.  So not only was spidering slow, but a spidering solution doesn't make it easy to perform incremental indexing.  I think a db driven search filter would make incremental indexing much much easier.  (The one difficulty with incremental indexing with all three of these programs is that none of the programs make it possible to delete from the index, so periodically, the entire index must be rebuilt to eliminate false positives.)

The htdig spidering solution is fine for sites where the searchable content doesn't change frequently, but I would probably want something better for a site with active bboards or wps.

(Other statistics from that experiment: using 20% of the CPU (one PIII 500), it looked through a million words finding about 60,000 unique words.  It built a 700K index, and a 1.2M excerpts database.  When I get a chance, I will open up that htDig index so that everyone can play with it and with no disparagement towards Don, it is a much more powerful interface than the standard out of the box search solution.  (Example, I can easily search for threads containing adida but not mello.)

That said, I don't care whether the searching program runs embedded with AOLserver or as an external process.  I started cobbling together a SWISH-E AOLserver module, but my intention was STILL to run that as a separate AOLserver.  Mainly I was using the AOLserver to get cheap and quick daemon, chroot, threading solution that can be controlled with tcl scripts and scheduled procs.  (I also figure that a nice extension to SWISH-E, SWISH++, or htDig would be to expose the search interface as either a) separate webserver, b) socket driven, c) XML-RPC search engine.)

I believe it is possible to design a process in which authors created abstracts and stored them in meta keywords that either of these three search engines can get to.  htDig will make excerpts on the fly, meaning that what it returns as the excerpt contains at least one of the search items. Again, user preference as to which is better.

Collapse
Posted by Kapil Thangavelu on
jerry,

when i was talking about direct access to the index, i was refering
to the 'openindex' command from the api sketch in your previous
message, from which i assumed you were trying to interface directly
to the index files (via some sort of wrapping of the various search
libraries). its clear from your response that your are more
interested in an external search daemon with rich exposed interfaces
(possibly another aolserver).

re incremental indexing - i ran into this same problem when i was
interfacing zope to swish++, the solution i came up with was to
store ids/urls of 'live' documents in a persistent btree and filter
search results based on that and have a cron job periodically
rebuild the index based on documents in the btree. i'm not sure how
applicable this solution would be to an acs/aolserver integration.

regarding choosing among the various search engines, i think there
should be some consideration of the various formats that the engines
can index. i think this becomes more relevant when cms integration
is considered and there exists possiblities of pdfs, ps, word docs,
etc. that should be intergrated with the search. From a casual look
at the other referenced engines only swish++ seems to support
indexing these types of docs without the aid of external (outside of
distribution) programs and hacking. i might well be wrong about
this, if anyone knows otherwise i'd like to know.

i'm curious about what kinda of integration you would envision for
an external search engine and the database? seems to me alot of the
proper indexing behavior is very application specific.

kryszstof,

your right about adding persistent sockets to the swish++ search
daemon, the overhead in the system is very low for a new connection.
esp. for a number of clients <= to the number of preallocated
threads. my informal stress testing indicated that the bottleneck
was not swish++ but the webserver.

Collapse
Posted by Jerry Asher on
Hmm, my casual reading of the docs was that each of the three indexing programs indexed objects like word, pdfs, or dbs in a remarkably similar way, that is, you create some sort of external progam to feed the index.  And each provided links to, or distributed perl scripts that worked with docs and pdfs.  But as I haven't gotten to setting that up just yet, I don't know for sure.
<p>
I believe that the appropriate indexing behavior is ACS module specific (to a first approximation).  I.e. the bboard module knows best what constitutes a bboard message, and similarly with faqs, file storage, etc.
<p>
My first thoughts are to use a mechanism similar to the new-stuff system: each interested module provides a routine that accepts a date, and when called with a given date, the module returns lists of the content (either text, HTML, or XML).
<p>
To perhaps better support incremental indexing, it might be good to make that date into a date range.  Because some content to be indexed can be huge, it might be better to give each module the option of returning lists of ALL the content between the date-range, or just returning SOME of the content and an indicator that there is more.  And what to do about PDFs and docs stored in the db, where you would like the indexing engine to remove the content from the PDF?  I would think this can be handled if the module can return the mime/type of each piece of content.
<p>
Finally, it looks as though some of these indexing programs are easiest to set up if they have some ability to spider the site, so the site-admin may want to direct each module to return the content, or to just return the URL of the content to return to the spider.
<blockquote><pre>
ad_proc moduleSearchCallback {begin end content_or_url_p more} {
    @returns [list more described_content_list]
      more :== an empty string if there is no more content
            or a cookie to return to the callback function
            to determine the next piece of content to return
      described_content_list :==
          [list mime/type title author date keyword-list url content]
  } {
}
</pre></blockquote>

An alternative approach would be to make it more table driven.  In this approach, a generic indexing routine might be made to fit many modules if it could work off a table:
<blockquote>
<pre>
create table IndexDescriptors (
    module_name            varchar, -- name of module
    table_name              varchar, -- name of table containing date and unique id
    unique_id_column        varchar, -- column integer uid of content
    date_column            varchar, -- column of date of content
    title_query            varchar, -- query returning title
    author_query            varchar, -- query returning author
    mime_query              varchar, -- query returning mime/type
    content_query          varchar, -- query returning content
    url_query              varchar  -- query returning URL
);
</pre>
</blockquote>
In this case, for modules that have an obvious date column, and use integers as their unique ids, it would be pretty easy to have a generic routine go through and index them.
<p>
Uh, gotta run to swim class with the kids, what are your thoughts?

Collapse
Posted by Krzysztof Kowalczyk on
I think that a string token would be better than date. Given token indexing module could ask each module: give me number of new things to index. Also: give me next n things to index a new token (this way indexing module could control indexing load, similar to what date range could do).

Internally modules could use dates as tokens, if that makes sense to them, but no need to force that on them.

Collapse
Posted by Jerry Asher on
I thought date ranges would help with the "incremental index rebuild" situation.

Most of these indexing programs don't let you delete content from the indexes, but do let you merge indexes together.  Also, most of the search programs can't work with the index while it is being rebuilt, but need to be restarted when the new index is ready.

The reason I suggested date ranges, is that I was thinking that one way to implement the "big reindex" of a site is by implementing lots of "little reindexes" of the site that then get merged into the current index.  I don't think that would cut necessarily the overall work down, but it would let web administrators schedule the work better.

For instance, with a site that had five years worth of content, where stuff that was added in the past might change or get deleted, you could implement several different strategies depending on your site's needs:

1.  Every night, reindex the whole site.

2.  Keep yearly indexes, this years index, and today's index.  Every hour, rebuild today's index the new stuff, and every night, index one year's worth of old stuff, creating a new index by merging the latest index with the other indices together (minus the index that just got rebuilt)

3.  Keep indexes on a monthly basis.  Every night, index the last month's worth stuff, and every Sunday night, reindex the whole site.

Anyway, I am think this could be done with a date range, but don't see how a more opaque token string would help the incremental merge situation.

Collapse
Posted by Jerry Asher on
I always love comments like this one, taken from an htdig mailing list:
> 3.2.0b3 now for a couple of months with no problems - it's working
> great... why's it still beta? have people had problems?

There are quite a few bugs that remain to be resolved, plus some
remaining performance issues. The latter will require some work to
overcome and will probably involve rewriting large parts of htsearch
(this will also give additional features).
Emphasis added with regard to a large rewrite that will also give additional features, but um, given their time table (ten months between point releases of their beta) so far, it could be quite a while before a non beta htDig is released.
Collapse
Posted by Talli Somekh on
Does it seem worth starting a new discussion about searching? The header of this one "OpenACS wish list" is a bit misleading since it has become almost exclusively about search.

Jerry, since you seem to be taking the lead over this, would it be too much to ask if you can put together a quick little wimpy point presentation about the strengths and weakness vis a vis OpenACS for htdig and the SWISH's? I know that you're very busy, but given the amount of information and knowledge you've gathered I bet it would be very helpful for the community in providing you with more support.

thanks.

talli

Collapse
Posted by Jerry Asher on
(the topic hasn't drifted so far, just from wish to swish.)

I have been planning on putting up a more formal discussion about search when I actually have something to show for my efforts....

Collapse
Posted by Kapil Thangavelu on
more thoughts on text indexing.

- its probably important to differentiate between aolserver and acs
solutions.

- an acs solution should probably have some integration with the cms
as this seems to be the common store for application content data,
and has a lot of 'free' info there (re mime/types), plus cms
integration might lower the application programmer burden for adding
search capabilities + maintainence.

- conversion of non-text to text, while there are alot of 3rd party
tools to do conversions from any particular document format, as has
already been mentioned in this thread swish++ comes with a tool to
extract text from binary data. i think this offers a great deal of
convience esp since 3rd party tools like wvware can be a pain to
compile on a server since they have lots of nested depends.

- aaran swartz suggested lucene.sourceforge.net as a possible
indexing mechanism, and i'm pretty impressed by its capabilites, 1mb
indexing heap, fast indexing, updates on indexes while being
searched, merged searches of multiple indexes, flexibility in
document definition, path limiting queries (in cvs). it would need a
socket server interface or perhaps an xml-rpc to be useful from
aolserver.

- also its probably worthwhile to check out the acs5 take on
searching and search metadata.
http://developer.arsdigita.com/acs-java/doc/services/search/doc/index.html

Collapse
Posted by Jerry Asher on
I've been playing with htDig and AOLserver/OpenACS integration. You can see a work in progress at http://www.theashergroup.com/demo/openacs. For the purpose of this demo, I occasionally index https://openacs.org/bboard, https://openacs.org/doc, and https://openacs.org/wp.

Ah, the cruft, the cruft, the cruft.

This demo shows: exact matches, fuzzy matches (stemming, wildcarding (the wildcard char is *), synonym lookups, speling correction), site and subsite searches, and date range searches

This demo doesn't yet implement incremental updates, and because I am indexing OpenACS and not my own site, the index is performed via crawling, and not through any OpenACS search/db integration.

So far to get this to work, I've

  • used htDig 3.2.0b3
  • made one patch to AOLserver (to "understand" semicolons in query strings) (alternatively you can make a simple patch to htDig to use ampersands.)
  • installed xpdf so I could index pdf
  • installed catdoc so I could index doc
The integration is "interesting". htDig has none of the interface we discussed above, it's pretty monolithic and likes to be a cgi program. So what I did was to take it's template html wrappers and turn them into template adp wrappers. My search page execs out to htsearch, captures the result and turns it over to ns_adp_parse, returning the results of that to the user.

Future work:

  • incremental indexing
  • db indexing - htDig wants to be a crawler, and that's not great for our OpenACS kind of pages. For example: the main /bboard page has all these wonderful links like "sort by activity" which can take you to a "statistics page". Naively, htDig spiders it correctly, but this is both inefficient and leads to search results where the bboard table of contents (listing every bboard message) often comes to the top of the results. htDig can be easily configured to avoid those links and I've done that. But it's clear that better htdig/db integration would lead to better results, quicker results, and leave your log files usable.

    So one goal of better db integration with htDig would be to create an API (like the newstuff system) where a module could quickly hook into htDig (or whatever)

    The other thinkg db integration can provide is much better date range matching. Right now, htDig thinks any tcl or adp page was just created today, so almost everything but html pages come up in the "last week filter"

  • stability - htDig has worked very well for me so far, but there are reports on the net of various bugs: crashes, and occasional db corruption.
  • xml generation and xml/rpc or soap interface: imagine an OpenACS search solution that let any OpenACS site request a search from any other OpenACS site. Do you use Michael Cleverly's community ubersearch? With an xml/rpc or soap interface this sort of search (and more powerful) can be an easy and efficient OpenACS module)
Lucene looks interesting. Would that I had seen that two weeks ago, eh?

I am not sure what to make of the ACS 5 search solution. I glanced at it briefly, and while the ACS and OpenACS certainly could use a search capability like that, it appears to be completely RDBMS centric. I.e. it doesn't appear to be implemented using Intermedia or any other specialized indexer, so I am left wondering what it's performance will be. I just gave this a brief glance, so I am not certain that they do not plan to rely on Intermedia or some other search engine technology.

Let me know what you think of my search implementation. I will try to make it available to the community soon, and I would really like some community help in discussing htdig/db integration issues.

Time now for the Simpson's and some more cold medicine: summer colds are just the worse.

Collapse
Posted by Jerry Asher on
Another search solution might be "mnoGoSearch" at search.mnogo.ru of course.  It claims out of the box support of PostgreSQL, Oracle, ODBC, and yes, even MySQL.  It's not at all clear what that means though.  Are they storing their index in the backend db, or do they let you write modules to index a db?  Not at all clear.

Gotta run.  Going to find out who shot Mr. Burns.

Collapse
Posted by Michel Henry de Generet on
<p>I liked the implementation of illustra where several kind of indexes could be linked to a table, and especially free-text index. But that seems quite a big job to implement that kind of index plugin in current database.
<p>I also like the idea of Software AG's Tamino where each XML file is stored and indexed by a free text index. This is conceptually much more attractive and open than decomposing the XML structure into a forest of tables.
Collapse
Posted by Janine Ohmer on
FWIW, I tried installing swish++ for a non-ACS client.

It didn't compile with gcc 2.96.  The author claimed that was due
to bugs in gcc, and he was quite rude about the whole thing.

I switched to the released version of htDig and had no problem
compiling it with the same "broken" compiler.  I also found that
the site had a friendly, we're here to help tone, in marked
contrast to the swish++ site which has more of a "don't bother
me riffraff" tone.

This has nothing to do with either one's appropriateness for
OpenACS, which I have not attempted to evaluate, but I would
choose htDig any time it would serve the purpose just because I
found the attitude of the swish++ author to be so offputting.  Life's
too short to deal with people with bad attitudes!

Now, another search suggestion - I see no mention of wwwDB
(http://www.wwwdb.org).  It supports both Postgres and Oracle
and that's all I know about it;  I saw it mentioned in a list of open
source tools for Oracle.  Might be worth someone's while to
check out, though.

Collapse
Posted by Jerry Asher on
wwwDB appears to be a nice tool, but not a search engine per se'.  Reading the overview, http://wwwdb.org/wwwdb/0000000000000000/WWWdb/WWWdb:Tools:ShowDoc;id=9it is more but a CGI based SQL DB "navigator" and FORMs tools.

Having never used AOLserver 2.33 much, I think it had something similar built in to it.  There is more to it than that, they built their website using it.

Collapse
Posted by Talli Somekh on
i know i asked this before, but i really do think it's time for another bboard...

talli