Forum OpenACS Development: Re: Semantic Search in OpenACS

Collapse
Posted by Adrian Ferenc on
Hi Neophytos,

I'm running into an issue when running the docker build command. When trying to run the command on line 124 of the Dockerfile, I am seeing the error
"1.143 error: patch failed: distfunc.c:11
1.143 error: distfunc.c: patch does not apply"

Any suggestions on how to get this to work would be greatly appreciated. I tried adding the --reject flag when running apply, but that did not work either. If there's any other information I could provide that would help you, please don't hesitate to ask.

Collapse
Posted by Neophytos Demetriou on
Hi Andrian, thanks for reporting this. I will check as soon as I am back home and let you know. Looks like it is related to pg_embedding.
Collapse
Posted by Neophytos Demetriou on
You were right Adrian. There were new commits in pg_embedding git repo (the postgresql extension) and the patch could not be applied anymore. It is good that you caught this so early. Using the head branch was a recipe for disaster.

There are no releases for pg_embedding so I created a tarball and I am using that in the Dockerfile now and, for your awareness, patch is no longer needed. Please give it a try and let me know if it worked or not. I will be available until around 5pm EST.

I will also change the Dockerfile to use stable releases of OpenACS and NaviServer just in case.

Collapse
Posted by Neophytos Demetriou on
One thing to note here is that the quality of the results depend on the language model being used. I used a very simplistic language model in the demo because of its size (20MB). In production, I would use a better model like all-mpnet-base-v2 (200MB). Please let me know if you need instructions how to download and convert it so that it can be used. I can also add a parameter to the two drivers so that you can setup a different model easily.

The other thing to note is that the results from pgvector-driver were poor in the demo after I added an index on the table (does approximate search in that case). So, I decided to switch the VectorDriver parameter in the search package to pgembedding-driver that produces better results now. In other words, the demo now uses pgembedding-driver for vector similarity search.

Finally, you cannot switch language model after data has been indexed without migrating to the new language model. You have to make a choice from the beginning and go with it.

Collapse
Posted by Adrian Ferenc on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

Also, I finally got access to the docker container today. I was working with my colleague (Dr. Yuen). We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line:

RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

to the Dockerfile so it would listen on the ip docker assigns. That sed command, or its equivalent in the config file may need to be refined. And then to run it, we used

docker run -d -p 8000:8000 pgvector-driver:latest

For reference, I am working with macOS and I believe my colleague is working on Windows.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing. I'm not sure how concerned you are with that, but I thought I'd bring it to your attention just in case. Again, thanks for your help

Collapse
Posted by Neophytos Demetriou on
Instructions on how to convert the better model would be great. Any kind of documentation about what is happening and where would be much appreciated.

I will try to write up something over the weekend. In the meantime, you might want to check out this video on the way they are trained as part of a language model: https://developers.google.com/machine-learning/crash-course/embeddings/video-lecture

In short embeddings encode semantic relations (bring relevant words in meaning together). So the vector of a word will be very close in distance (e.g. euclidean, cosine) to the vector of a similar word. For example, cat and dog are similar in at least one dimension i.e. they are both animals.

This is done when the language model is trained via a neural network. tbert is based on bert.cpp that does inference of BERT neural net architecture with pooling and normalization from SentenceTransformers (https://www.sbert.net/ - this is what I used in Python). tbert computes the embeddings vector based on a language model. There are lots of them in huggingface.

When some title is indexed, pgvector-driver and pgembedding-driver ask tbert to compute the vector based on the language model that is used and the result is stored in pgvector or pgembedding columns in the database. Upon search, tbert again computes the vector of the query of the user and then asks pgvector or pgembedding to rank them by similarity (basically euclidean distance between the vectors in both of these drivers).

We were able to build the image after the change you made, but when trying to run the container we weren't able to access it outside of the container itself. We added this line: RUN sed -i 's/127.0.0.1/0.0.0.0/g' /usr/local/ns/config-oacs-5-10-0.tcl

Will check it out and make the change. Thanks.

Oh, also, at one point we accidentally were trying to build using the dockerfile in the pgembedding-driver directory and the build was failing.

I forgot I had it there as well. I was updating the one in openacs-packages and copying to pgvector-driver. Fixed in pgembedding-driver as well.

Collapse
Posted by Adrian Ferenc on
Thank you. That video and your explanation was very helpful. I hope in what you write up you can also explain/point to the code of the implementation, for example the steps that go from making a query in openacs to creating an embedding with tbert to querying the db with the computed vector.
Collapse
Posted by Neophytos Demetriou on
Here is the document I promised: Semantic Search with tBERT

Looking forward to improve it based on your feedback.

Collapse
Posted by Adrian Ferenc on
Thank you so much! I am very excited to look through it
Collapse
Posted by Neophytos Demetriou on
Hi Adrian, thanks for being so kind. If I can elaborate on anything either in the document or here, please do not hesitate and let me know.