Forum OpenACS Development: Re: Semantic Search in OpenACS

Collapse
Posted by Neophytos Demetriou on
I'm updating the docker images. Will post as soon as I'm done. You can see the result here (see screenshot): https://github.com/jerily/openacs-packages/tree/main/pgvector-driver

You will be able to play with the docker image demo once am done.

Collapse
Posted by Neophytos Demetriou on
I'm done.

1. See screenshots here: https://github.com/jerily/openacs-packages/tree/main/pgvector-driver --- Pay special attention to the second screenshot in which PostgreSQL FTS did not return any results. The search query was "I hate this dish". Top result from related searches: "I don't like this dish. Taste is poor".

2. You can try it all with docker, including the food hunter reviews that I populated in xowiki and you can search them already. Here are the commands again:

git clone https://github.com/jerily/openacs-packages.git
cd openacs-packages/pgvector-driver
docker build . -t pgvector-driver:latest --no-cache
docker run --network host pgvector-driver:latest

Once you run the last command, you can point your browser to http://localhost:8000/ and login with the following credentials:

email: test at example dot com
password: test

Any and all feedback is welcome.

Collapse
Posted by Neophytos Demetriou on
You can try it all with docker, including the food hunter reviews that I populated in xowiki and you can search them already.

I forgot to mention that content is indexed both in pgvector-driver and pgembedding-driver and you can switch between the two in the demo by changing the new VectorDriver parameter of the search package (patch available in git repo). So, you can compare the results for yourselves.

That said, there's no point polishing these packages further until there is some sort of response from the community on where they want to take this. All the code is available in the aforementioned git repositories. Once I have some feedback, I'll improve them.

My time is limited as well starting next week as I'll be back to work but the OpenACS packages (pgvector-driver & pgembedding-driver) and tbert (the C-based module for vector embeddings) are in good shape and so I can make any kind of small changes or improvements to them based on the feedback I receive.

I'll switch to multi-factor authentication now in the hope that I can provide a proof of concept in the next couple of days.

Collapse
Posted by Adrian Ferenc on
Hi Neophytos,

I'm running into an issue when running the docker build command. When trying to run the command on line 124 of the Dockerfile, I am seeing the error
"1.143 error: patch failed: distfunc.c:11
1.143 error: distfunc.c: patch does not apply"

Any suggestions on how to get this to work would be greatly appreciated. I tried adding the --reject flag when running apply, but that did not work either. If there's any other information I could provide that would help you, please don't hesitate to ask.

Collapse
Posted by Neophytos Demetriou on
Hi Andrian, thanks for reporting this. I will check as soon as I am back home and let you know. Looks like it is related to pg_embedding.
Collapse
Posted by Neophytos Demetriou on
You were right Adrian. There were new commits in pg_embedding git repo (the postgresql extension) and the patch could not be applied anymore. It is good that you caught this so early. Using the head branch was a recipe for disaster.

There are no releases for pg_embedding so I created a tarball and I am using that in the Dockerfile now and, for your awareness, patch is no longer needed. Please give it a try and let me know if it worked or not. I will be available until around 5pm EST.

I will also change the Dockerfile to use stable releases of OpenACS and NaviServer just in case.

Collapse
Posted by Neophytos Demetriou on
One thing to note here is that the quality of the results depend on the language model being used. I used a very simplistic language model in the demo because of its size (20MB). In production, I would use a better model like all-mpnet-base-v2 (200MB). Please let me know if you need instructions how to download and convert it so that it can be used. I can also add a parameter to the two drivers so that you can setup a different model easily.

The other thing to note is that the results from pgvector-driver were poor in the demo after I added an index on the table (does approximate search in that case). So, I decided to switch the VectorDriver parameter in the search package to pgembedding-driver that produces better results now. In other words, the demo now uses pgembedding-driver for vector similarity search.

Finally, you cannot switch language model after data has been indexed without migrating to the new language model. You have to make a choice from the beginning and go with it.

Collapse
Posted by Adrian Ferenc on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

Also, I finally got access to the docker container today. I was working with my colleague (Dr. Yuen). We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line:

RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

to the Dockerfile so it would listen on the ip docker assigns. That sed command, or its equivalent in the config file may need to be refined. And then to run it, we used

docker run -d -p 8000:8000 pgvector-driver:latest

For reference, I am working with macOS and I believe my colleague is working on Windows.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing. I'm not sure how concerned you are with that, but I thought I'd bring it to your attention just in case. Again, thanks for your help

Collapse
Posted by Neophytos Demetriou on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

I will try to write up something over the weekend. In the meantime, you might want to check out this video on the way they are trained as part of a language model: https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

In short embeddings encode semantic relations (bring relevant words in meaning together). So the vector of a word will be very close in distance (e.g. euclidean, cosine) to the vector of a similar word. For example, cat and dog are similar in at least one dimension i.e. they are both animals.

This is done when the language model is trained via a neural network. tbert is based on bert.cpp that does inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (https://www.sbert.net/ - this is what I used in Python). tbert computes the embeddings vector based on a language model. There are lots of them in huggingface.

When some title is indexed, pgvector-driver and pgembedding-driver ask tbert to compute the vector based on the language model that is used and the result is stored in pgvector or pgembedding columns in the database. Upon search, tbert again computes the vector of the query of the user and then asks pgvector or pgembedding to rank them by similarity (basically euclidean distance between the vectors in both of these drivers).

We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line: RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

Will check it out and make the change. Thanks.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing.

I forgot I had it there as well. I was updating the one in openacs-packages and copying to pgvector-driver. Fixed in pgembedding-driver as well.

Collapse
Posted by Adrian Ferenc on
Thank you. That video and your explanation was very helpful. I hope in what you write up you can also explain/point to the code of the implementation, for example the steps that go from making a query in openacs to creating an embedding with tbert to querying the db with the computed vector.
Collapse
Posted by Neophytos Demetriou on
Here is the document I promised: Semantic Search with tBERT

Looking forward to improve it based on your feedback.

Collapse
Posted by Adrian Ferenc on
Thank you so much! I am very excited to look through it
Collapse
Posted by Neophytos Demetriou on
Hi Adrian, thanks for being so kind. If I can elaborate on anything either in the document or here, please do not hesitate and let me know.