Forum OpenACS Development: String matching

1: String matching

Posted by Iuri Sampaio on 08/11/19 03:48 AM

After reading a tip to write string matching function using regexp, directly from postgresql.org, I wondered if that's exactly what I need.

I need to match a keyword with every record on table tag_tags (i.e. tags pkg)

POSIX-style regular expressions is the approach I've written so far.

Is there any better option?

https://www.postgresql.org/docs/9.3/functions-matching.html

Tip: If you have pattern matching needs that go beyond this, consider writing a user-defined function in Perl or Tcl.

There are three separate approaches to pattern matching provided by PostgreSQL: the traditional SQL LIKE operator, the more recent SIMILAR TO operator (added in SQL:1999), and POSIX-style regular expressions. Aside from the basic "does this string match this pattern?" operators, functions are available to extract or replace matching substrings and to split a string at matching locations.

2: Re: String matching (response to 1)

Posted by Patrick Giagnocavo on 08/11/19 04:18 AM

Do you need to match only part of the words, or the whole word? There is full-text searching available in Postgres and it works very well...

3: Re: String matching (response to 2)

Posted by Iuri Sampaio on 08/11/19 04:27 AM

It can be partial or full. The best would be to return by accuracy. For example, returns all item_ids, which are 80% similar to the keyword.

I know full-text searching works well with tsearch2 (oacs pkg). I'll need to build the integration though. https://openacs.org/doc/tsearch2-driver/

Is that path you were trying to go?

4: Re: String matching (response to 1)

Posted by Gustaf Neumann on 08/11/19 12:45 PM

First of all, when working with tags, using regexp and/or string match is crude and not very efficient. look e.g. how the tags are implemented in the xowiki package.

Secondly, you can use the PostgreSQL full text search no only via the OpenACS tsearch2 package, but for arbitrary tables. See e.g. [1] or [2] on tutorials on tsvector and tsquery.

Thirdly, there is a long history of research, how to implement "near matches" like soundex, Levenshtein distances [3], trigrams [4], which are supported directly by PostgreSQL, and there are many more like e.g. Jaro Winkler, cosine distances, Smith-Waterman-Gotoh (sequence alignment), and many more via the pg_similary [5] extension.

PostgreSQL has a great array of capabilities in this area. It is typically much better to select data from database with such filters than to load everything into memory and to filtering on the application layer.

-g

[1] https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
[2] https://linuxhint.com/postgresql-full-text-search-tutorial/
https://www.postgresql.org/docs/11/fuzzystrmatch.html
[3] https://www.postgresql.org/docs/11/fuzzystrmatch.html
[4] https://www.postgresql.org/docs/11/pgtrgm.html
[5] https://github.com/eulerto/pg_similarity