Forum OpenACS Development: Semantic Search in OpenACS

Collapse
Posted by Neophytos Demetriou on
Hi everyone,

I'm planning to work on a semantic search package that can perform natural language queries over a collection of documents. My plan is to use pgvector, solr, and faiss as different options for this package.

Pgvector is a PostgreSQL extension that enables fast vector similarity search using indexes. Solr is a popular open source search platform that supports various features such as faceting, highlighting, and spell checking. Faiss is a library for efficient similarity search and clustering of dense vectors.

The idea is to use pgvector, solr or faiss to store and index the document embeddings, which are generated by a pre-trained language model. This is the tricky part, i.e. generating the document embeddings. I did the exact same thing in Python and it is very easy. For TCL/OpenACS we will need a C-based module e.g. for naviserver.

I would appreciate any comments or suggestions on this project. Thanks for reading!

Collapse
Posted by Gustaf Neumann on
I've also turned my attention to vector search in PostgreSQL around the beginning of the year (version 0.3.*), but there is so much left to do, and so little time to do it. .... I would appreciate a joint effort. Not sure, we need a c-based module, but you have looked probably deeper into it than I have.
Collapse
Posted by Neophytos Demetriou on
Hi Gustaf,

Looking forward to a joint effort. The C module is needed to produce the document embeddings. I'm on my way back from holidays. I'll post the details as soon as am back home. In short, I plan to use https://github.com/skeskinen/bert.cpp for the underlying library. This will allow me to load an existing huggingface model e.g. all-mpnet-base-v2. Storing and indexing with pgvector and solr should be simple. faiss might need its own TCL binding. We'll see.

I'll explain why this is a superior approach to plain search when I get home.

Collapse
Posted by Neophytos Demetriou on
I have shared an initial implementation of the TCL/C extension with Gustaf. In short, it provides three commands: load_model, unload_model, and ev (standing for embeddings vector). I'm going to use this module tomorrow to compute the vectors to store in pgvector (and later on in solr and faiss).

In short, an embedding vector is a list of numbers that captures some of the semantics of the input by placing semantically similar inputs close together. This list of numbers depends on the language model and how it was trained. Embedding vectors help us find phrases that are relevant to a query, even if they have different words.

Here's an example of what the results might look like to a given query (note that the first result contains only one of the keywords of the user, yet it is the most similar according to the language model used):

Search query: "Should I get health insurance?"

Search results:
1. Should I sign up for Medicare Part B if I have Veterans' Benefits?
(similarity score: 0.5152)
2. Can I sign up for Medicare Part B if I am working and have health insurance through an employer?
(similarity score: 0.4782)
3. How can I get help with my Medicare Part A and Part B premiums?
(similarity score: 0.4490)

If you have any questions, please do not hesitate and let me know.

Collapse
Posted by Neophytos Demetriou on
Simplified install for TCL/C module.
The source code is here for now: https://github.com/jerily/tbert
I'm working on a pgvector-driver for OpenACS.
I will post when am done.

PS. I will either have to post instructions how to download the model files and convert them to ggml (requires having python installed) or upload the converted ones some place. OpenACS file storage limit (20 MB) is too low.

Collapse
Posted by Neophytos Demetriou on
I've created a Dockerfile so you can try this out without much hassle. Here are the commands for your convenience:

git clone --recurse-submodules https://github.com/jerily/tbert.git
cd tbert
docker build . -t tbert:latest
docker run --rm -it --entrypoint bash tbert:latest

Then, inside the container:

cd tbert/build
tclsh8.6 ../example.tcl ../bert.cpp/models/all-MiniLM-L12-v2/ggml-model-q4_0.bin

You can even copy the model as follows:

id=$(docker create tbert:latest)
docker cp $id:/tbert/bert.cpp/models/all-MiniLM-L12-v2/ggml-model-q4_0.bin .
docker rm -v $id
ls -la ggml-model-q4_0.bin

This is likely the approach to go with models i.e. download a couple of models inside the docker container and then copy them over to the host machine to try them out on your openacs instance.

Collapse
Posted by Neophytos Demetriou on
Here's a few notes about an early version of the pgvector-driver:

1. The source code lives here for now: https://github.com/jerily/openacs-packages

2. I had to use the same table name as tsearch2-driver i.e. txt because I was testing with xowiki that has the table name hardcoded in ::xowiki::datasource (xowiki-sc-procs.tcl).

3. I have to update the tbert C module to work with naviserver. For now, I was testing by simply loading the shared library i.e. load libtbert.so.

4. I will polish the package further tomorrow and, most likely, provide a Dockerfile so that people can test easily.

IMPORTANT NOTE: Not sure when tsearch2-driver changed but, to the best of my understanding, it is no longer taking ranking into consideration (ranking is lost after checking permissions and the distinct clause). Pretty sure the original version that I did some 20 years ago did not have that issue.

Collapse
Posted by Neophytos Demetriou on
I will polish the pgvector-driver today and then check out pg_embedding postgresql extension that was just released: https://neon.tech/blog/pg-embedding-extension-for-vector-search
Collapse
Posted by Neophytos Demetriou on
1. I have made the changes to tbert to produce a NaviServer module that can be used in your OpenACS/NaviServer config.

2. Added instructions to pgvector-driver package readme.md how to build and configure the tbert NaviServer module.

3. I'll provide a Dockerfile so that you can test with ease.

4. Once I'm done with the Dockerfile, I plan to do a pgembedding-driver package that also uses tbert to compute the vector embeddings and if there is enough time I'll create a drop-in replacement for tsearch2-driver that fixes the ranking issue (no wonder why openacs.org search results do not make sense).

Collapse
Posted by Neophytos Demetriou on
Dockerfile for running OpenACS with pgvector-driver is ready. Here's what you have to do:

docker build . -t pgvector-driver:latest
docker run --network host pgvector-driver:latest

After you execute above commands, all you have to do is point your browser to the following url and create some content in xowiki - you can then try to search for it from search: http://localhost:8000/

Notes:

1. Please make sure you are running nothing on port 5432 of your host machine. In other words, stop postgresql service on your host machine in order to try this.

2. It only indexes titles for now.

3. I will try to populate xowiki with some CSV data that I have and update the postgres dump that I'm using for this.

That's all for now. If there are any questions, please do not hesitate and contact me.

Collapse
Posted by Neophytos Demetriou on
I forgot to ask you to checkout the git repo first in my last message. Here are the commands for running the demo for pgvector-driver again:

git clone https://github.com/jerily/openacs-packages.git
cd openacs-packages/pgvector-driver
docker build . -t pgvector-driver:latest
docker run --network host pgvector-driver:latest

That's all!

Collapse
Posted by Neophytos Demetriou on
I'll create another database dump and share the credentials so that you can try it out. It only has three pages in xowiki at the moment that you can search.
Collapse
Posted by Neophytos Demetriou on
Should be fine now.

email: test at example.com
password: test

If you already build the docker image you might need to redo it with (rest of the steps are the same):
docker build . -t pgvector-driver:latest --no-cache

Collapse
Posted by Neophytos Demetriou on
Added pgembedding-driver that uses pg_embedding PostgreSQL extension to my openacs-packages repository (see above). It may be faster than pgvector but it seems to me that pgvector is more robust at the moment. I had to go through some hoops to get it to compile on my system.

Anyway, the two vector similarity packages I provided together with tbert are put out there as proof of concept that TCL/OpenACS has these capabilities now.

Here's my plan for the next couple of days:

1. Resolve the issue with the current version of tsearch2-driver (see above - ranking is lost - search results are NOT in order). Most likely I'll provide a drop in replacement for it i.e. pgfts-driver.

2. Polish tbert, pgvector-driver, and pgembedding-driver. For example, cmake does not seem to be the preferred choice for NaviServer modules. I'll sort it out. Just wanted to have all dependencies installed from the same git repo. Furthermore, I need to add the index for the pgvector-driver package and add permission checking.

3. Ideally, the vector similarity packages should be used to provide additional search results (similar to your search kind of thing). This would require some changes to the search package. They are easy to make but they are out of scope for now.

4. I won't do solr and faiss unless someone really needs them. Both solr and faiss have the downside that they won't have the acs objects. So, searching will slow things.

PS. If there is no interest from the community about vector similarity search, I might as well turn my attention to a two-factor authentication solution for OpenACS. It would require a C-based module (maybe two naviserver modules) as well but I have that under control.

Collapse
Posted by Neophytos Demetriou on
So, searching will slow things.

I meant retrieving the search results (acs objects) from the db after solr and faiss responded.

Collapse
Posted by Neophytos Demetriou on
For example, cmake does not seem to be the preferred choice for NaviServer modules. I'll sort it out. Just wanted to have all dependencies installed from the same git repo.

This is done (added NaviServer module Makefile - thanks Gustaf) plus I've made the cmake installation more robust for TCL installation. If you have trouble installing tbert on your system please do not hesitate and contact me via email or here. I'm using Ubuntu Linux and so are the docker images.

Collapse
Posted by Neophytos Demetriou on
I had to use the same table name as tsearch2-driver

First thing today, after my morning walk, I will change this to use its own table and modify search package to have an optional Related Searches section. I don't remember right now if all service contracts implementations of FtsEngineProvider are used during indexing in the search observer. I think that was the case when I created the package. I will check and made any necessary changes. Who is the search package maintainer these days? Otherwise, please let me know where I should send the patch.

Next thing should be to add parameters to pgvector-driver to be able to customize the model to be used. Finally, the rest of the things I mentioned.

Collapse
Posted by Neophytos Demetriou on
I'm updating the docker images. Will post as soon as I'm done. You can see the result here (see screenshot): https://github.com/jerily/openacs-packages/tree/main/pgvector-driver

You will be able to play with the docker image demo once am done.

Collapse
Posted by Neophytos Demetriou on
I'm done.

1. See screenshots here: https://github.com/jerily/openacs-packages/tree/main/pgvector-driver --- Pay special attention to the second screenshot in which PostgreSQL FTS did not return any results. The search query was "I hate this dish". Top result from related searches: "I don't like this dish. Taste is poor".

2. You can try it all with docker, including the food hunter reviews that I populated in xowiki and you can search them already. Here are the commands again:

git clone https://github.com/jerily/openacs-packages.git
cd openacs-packages/pgvector-driver
docker build . -t pgvector-driver:latest --no-cache
docker run --network host pgvector-driver:latest

Once you run the last command, you can point your browser to http://localhost:8000/ and login with the following credentials:

email: test at example dot com
password: test

Any and all feedback is welcome.

Collapse
Posted by Neophytos Demetriou on
You can try it all with docker, including the food hunter reviews that I populated in xowiki and you can search them already.

I forgot to mention that content is indexed both in pgvector-driver and pgembedding-driver and you can switch between the two in the demo by changing the new VectorDriver parameter of the search package (patch available in git repo). So, you can compare the results for yourselves.

That said, there's no point polishing these packages further until there is some sort of response from the community on where they want to take this. All the code is available in the aforementioned git repositories. Once I have some feedback, I'll improve them.

My time is limited as well starting next week as I'll be back to work but the OpenACS packages (pgvector-driver & pgembedding-driver) and tbert (the C-based module for vector embeddings) are in good shape and so I can make any kind of small changes or improvements to them based on the feedback I receive.

I'll switch to multi-factor authentication now in the hope that I can provide a proof of concept in the next couple of days.

Collapse
Posted by Adrian Ferenc on
Hi Neophytos,

I'm running into an issue when running the docker build command. When trying to run the command on line 124 of the Dockerfile, I am seeing the error
"1.143 error: patch failed: distfunc.c:11
1.143 error: distfunc.c: patch does not apply"

Any suggestions on how to get this to work would be greatly appreciated. I tried adding the --reject flag when running apply, but that did not work either. If there's any other information I could provide that would help you, please don't hesitate to ask.

Collapse
Posted by Neophytos Demetriou on
Hi Andrian, thanks for reporting this. I will check as soon as I am back home and let you know. Looks like it is related to pg_embedding.
Collapse
Posted by Neophytos Demetriou on
You were right Adrian. There were new commits in pg_embedding git repo (the postgresql extension) and the patch could not be applied anymore. It is good that you caught this so early. Using the head branch was a recipe for disaster.

There are no releases for pg_embedding so I created a tarball and I am using that in the Dockerfile now and, for your awareness, patch is no longer needed. Please give it a try and let me know if it worked or not. I will be available until around 5pm EST.

I will also change the Dockerfile to use stable releases of OpenACS and NaviServer just in case.

Collapse
Posted by Neophytos Demetriou on
One thing to note here is that the quality of the results depend on the language model being used. I used a very simplistic language model in the demo because of its size (20MB). In production, I would use a better model like all-mpnet-base-v2 (200MB). Please let me know if you need instructions how to download and convert it so that it can be used. I can also add a parameter to the two drivers so that you can setup a different model easily.

The other thing to note is that the results from pgvector-driver were poor in the demo after I added an index on the table (does approximate search in that case). So, I decided to switch the VectorDriver parameter in the search package to pgembedding-driver that produces better results now. In other words, the demo now uses pgembedding-driver for vector similarity search.

Finally, you cannot switch language model after data has been indexed without migrating to the new language model. You have to make a choice from the beginning and go with it.

Collapse
Posted by Adrian Ferenc on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

Also, I finally got access to the docker container today. I was working with my colleague (Dr. Yuen). We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line:

RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

to the Dockerfile so it would listen on the ip docker assigns. That sed command, or its equivalent in the config file may need to be refined. And then to run it, we used

docker run -d -p 8000:8000 pgvector-driver:latest

For reference, I am working with macOS and I believe my colleague is working on Windows.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing. I'm not sure how concerned you are with that, but I thought I'd bring it to your attention just in case. Again, thanks for your help

Collapse
Posted by Neophytos Demetriou on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

I will try to write up something over the weekend. In the meantime, you might want to check out this video on the way they are trained as part of a language model: https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

In short embeddings encode semantic relations (bring relevant words in meaning together). So the vector of a word will be very close in distance (e.g. euclidean, cosine) to the vector of a similar word. For example, cat and dog are similar in at least one dimension i.e. they are both animals.

This is done when the language model is trained via a neural network. tbert is based on bert.cpp that does inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (https://www.sbert.net/ - this is what I used in Python). tbert computes the embeddings vector based on a language model. There are lots of them in huggingface.

When some title is indexed, pgvector-driver and pgembedding-driver ask tbert to compute the vector based on the language model that is used and the result is stored in pgvector or pgembedding columns in the database. Upon search, tbert again computes the vector of the query of the user and then asks pgvector or pgembedding to rank them by similarity (basically euclidean distance between the vectors in both of these drivers).

We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line: RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

Will check it out and make the change. Thanks.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing.

I forgot I had it there as well. I was updating the one in openacs-packages and copying to pgvector-driver. Fixed in pgembedding-driver as well.

Collapse
Posted by Adrian Ferenc on
Thank you. That video and your explanation was very helpful. I hope in what you write up you can also explain/point to the code of the implementation, for example the steps that go from making a query in openacs to creating an embedding with tbert to querying the db with the computed vector.
Collapse
Posted by Neophytos Demetriou on
Here is the document I promised: Semantic Search with tBERT

Looking forward to improve it based on your feedback.

Collapse
Posted by Adrian Ferenc on
Thank you so much! I am very excited to look through it
Collapse
Posted by Neophytos Demetriou on
Hi Adrian, thanks for being so kind. If I can elaborate on anything either in the document or here, please do not hesitate and let me know.
Collapse
Posted by Neophytos Demetriou on
Just a heads up that tbert now builds fine with the NaviServer Makefile - like all other modules in the NaviServer ecosystem. It took me a while to get it right as tbert contains some C++ code.

I am using Ubuntu Linux. I will gladly incorporate any changes to make it work on other platforms as well. Any and all feedback is welcome.

Collapse
Posted by Neophytos Demetriou on
Just a quick message that I've simplified the installation instructions in readme.md to separate building the dependencies from building the TCL extension or NaviServer module:
https://github.com/jerily/tbert/blob/main/readme.md

As far as the NaviServer module is concerned, I think we are good as it uses the NaviServer Makefile.

The problem is the TCL extension that is currently being built with cmake. I have no way to check it on a Mac - it builds and installs fine on Ubuntu. Apparently, macOS distinguishes between shared libraries and loadable modules. So, I have added MODULE in CMakeLists.txt (line 14). If someone can get the latest and try building the TCL extension on a Mac and let us know of their findings, it would be great.

In other words, when you do "make install" for TCL build, it shows what libraries it installed. The question is if it tried to install a .dylib or a .so or both on a Mac.

Collapse
Posted by Gustaf Neumann on
I can confirm that building/loading/testing the NaviServer module under macOS works now perfectly.
Collapse
Posted by Neophytos Demetriou on
Thank you Gustaf (and for the PR/improvements in the code).