Docs

Pricing

Community

Reading List

Get Started

- It's Free!

Back

CTO’s Tech Notes: In-Depth Analysis of Search Strategies for Momen

Explore Momen's text search strategies. Unveil insights driving powerful search capabilities, improving user experience. Discover efficient algorithms to achieve text search optimization

Tech

No-code

2024/01/09

Published

Takes

15mins

to read

Takes

15mins

to read

Yaokai Jiang

I studied Computer Science at Cambridge and Carnegie Mellon, then coded at Medallia Inc. Now, I'm the hands-on CTO of Momen. Off the clock, I chat about investment on YouTube, throw a few punches in boxing, and dive into books from psychology to history.

Introduction

Momen is a cloud-native no-code IDE (integrated development environment) that we here at Momen have been developing since early 2020. It aims to significantly lower the hurdle one has to overcome to build an application that is capable of meeting the requirements of serving requests from a non-insignificant crowd. A very common requirement is text search. I would like to share some of my thoughts on this requirement.

TL;DR

Install pg_trgm extension, use a GIN index and add an exclude constraint using hash if needed. GIN with pg_trgm will get you indexed operation on 'LIKE', 'ILIKE', 'SIMILAR TO' and '~', while the exclude constraint using hash gives you both uniqueness and super fast point selects.

Breaking Down the Requirement

A lot is implied when our customers ask the question: "Do you support search?". After some digging, I narrowed them down to the following.

· Exact matches. e.g. SELECT * FROM account WHERE phone_number = '12312345678'. This is the easiest type to satisfy.

· Prefix matches. e.g. SELECT * FROM post WHERE post.title ILIKE 'Momen%'. As we shall see later, these are quite easy as well.

· Suffix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%Momen'. These look similar to prefix matches, and indeed are quite similar.

· Infix matches. e.g. SELECT * FROM post WHERE post.title ILIKE '%Momen%'. These are quite a bit more difficult to deal with, and require a different strategy entirely.

· Regular-expression matches. e.g. SELECT * FROM post WHERE post.title ~ '(Momen|mirror)'. Yeah, a tough nut to crack.

· Full-text searches. e.g. SELECT * FROM post WHERE post.content @@ to_tsquery('Momen & mirror'). (Postgres syntax) This one is especially difficult to deal with, and we shall reserve this for a separate post.

In addition to those "search" requirements, our customers frequently use "unique" constraint on the content of the text. This one although seems innocuous on the surface, can become a huge pain under certain circumstances.

Relevant Background Information

Our product mirror is opinionated. We are a start-up team. We are very cognizant of the resources that we have, and being opinionated allows us to not be boggled down by the innumerable combinations of technology choices at every level of the stack. We also wanted to have as few components as possible while meeting our customers' needs.

Our choice for database is Postgres, and our API was initially generated by hasura. Hasura is a GraphQL backend that essentially provides a pass-through SQL interface, and does not offer the ability to inject custom logic for specific types of queries. That means whatever we decide to do to optimize the SQL queries generated by all those different kinds of "searches".

Route to Optimization

Setup

It is not scientific, but should be representative.

Surface Book 2, i7-8650U + 16GB + 512GB.

Postgres 14 installed on Windows' NTFS (not WSL2).

Populated via

Exact Matches

This one is simple enough. All it takes is to create any kind of index that supports equality test.

Before any optimization, let's establish the baseline.

Notice the parallel sequential scan. Now let's just create a vanilla B-tree index on content.

We now have an index scan. And execution time goes from hundreds of ms to hundreds of us, about a 1000x difference on this 1 million row dataset.

Prefix Matches

A bit more interesting obviously, so let's try these with the previously created B-tree index on content.

Odd, we are not going through index at all. We ended up with a parallel sequential scan. One would think prefix matches should be indexable since we can quite simply express this as a conjunction of two range conditions, and B-trees are perfect for those queries. So let's try that.

Amazing! The index is performing its magic tricks again! So it appears that Postgres wasn't smart enough to figure out this transformation. Oh, what happened to SQL's promise of "let users express what data to retrieve, let the engine underneath take care of seamlessly retrieving it"? It turns out that the reason behind Postgres' "inability" to translate "LIKE 'value%'" was due to the

limitations of the >= and < operators. Both only operate on a byte value level, and once you have some collation other than "C", things can breakdown very quickly. So that "inability" was really Postgres' attempt to protect us from ourselves. The correct operators to use turns out to be "~>=~" and "~<~". Let's bask in the light of some magic:

That "text_pattern_ops" is all the "nudge" Postgres needed. "text_pattern_ops" is an "opclass". "The operator class identifies the operators to be used by the index for that column. ", says Postgres' official document. Usually we do not specify one, and that just

means that Postgres will choose the default operator class for that column's data type. In addition to =, the default operator class supports the following operations: <, <=, >, and >=. And that was why the "LIKE" operator that operates on two text values, which is implemented by Postgres' "~~", wasn't getting the proper treatment. On line 6, we can clearly see that the index condition involves the text version of ">=" and "<", namely "~>=~" and "~<~". And Postgres does indeed translate the "~~" condition for us!

A natural follow up would be, what are all the operators supported by "text_pattern_ops"?

You might have guessed, since we dropped the index that supported <, <=, >, and >= operators, some queries now suffer.

And yes, now we are back to sequential scans. Though one could very reasonably argue, few requirements would involve byte-wise string range lookups. And you could always choose to keep the original B-tree index.

Suffix Matches

Now we know what operators "text_pattern_ops" support, it is not difficult to come to the conclusion that we can't express LIKE '%something' solely in terms of the text version of less-than, less-than-or-equal-to, equal-to, greater-than and greater-than-or-equal-to. (Give it a try if you don't believe me).

And that is reflected Postgres' query plan:

As expected, we have reverted back to sequential scan. However, once we reverse the string, "%query" becomes effectively "query%". So let's go that route.

And yes, as expected, we have the beautiful performance afforded to us by the lovely index.

Infix Matches

Looking at infix matches, it is not possible to translate "LIKE '%query%'" in terms of any combination of "~>~" and "~<~" operators. What we now need is proper "~~" operator support natively in the index. To find such an index type, we need to first find the opclasses that support the "~~" operator, and then find the index types compatible with such opclasses.

It turns out that none of Postgres' default opclasses support "~~(text, text)", and one can run the following sql to verify:

However, an extension that is by default available in Postgres, called "pg_trgm", supports "~~(text, text)". As the name suggests, this extension gives Postgres some trigram-related abilities. So let us install that.

And we have two newly available options, using GIST +gist_trgm_ops or GIN + gin_trgm_ops, let's try out both.

Looks like GIST is slower to build than GIN, and much slower on shorter search patterns. So slow that GIST is only getting less than 2x performance compared to a parallel sequential scan on this 1M row table. In fact, GIST index takes more space, too, due to its tree structure.

So why do we even bother with GIST? Insert/update speed. Due to the "inverted index" nature, each insertion/update can trigger updates to a huge number of pages, significantly slowing the update speed. Further discussion of GIN indexes deserves its own post.

What we observe is somewhat consistent with Postgres 9.4's documentation on this topic. As the size of the text grows, GIN's performance degradation is much more severe. So this now becomes a trade-off, is your data read-heavy or write-heavy?

Both GIN and GIST support a ton of text operations, and it seems other than size and insert/update performance, there now seems no particular reason to rely on B-tree indexes for text fields.

Regular-Expression Matches

Remember when we said "both GIN and GIST support a ton of text operations"? Well the payback is quick on this one.

Again, GIST seems like a rather clear loser, though still performing better than sequential scan. So the answer would most likely be, "just use a GIN with trigrams".

Uniqueness Constraint

Neither GIN nor GIST indexes can be used to support a unique constraint. In fact, according to Postgres' own documentation, "currently, only B-tree indexes can be declared unique."

That is usually not too much of a problem, but what about a huge string that doesn't fit in a single page? All B-tree entries must fit inside a page, which in most cases, is 8kB in size. And if we try to create a (unique) index on a column that does contain large strings greater than 8kB in size, we are greeted with the following error message.

But there are ways out. We can shorten the content inside a B-tree index by first hashing them.

The problem with this approach is that although we have a B-tree index that is orders of magnitude better at point selects than GIN or GIST indexes (reason is again worth another post), we can't really utilize this index unless we modify our query to operate on (digest(content, 'sha512')), rather than the vanilla content. Doable, not great though.

Since we are doing hashing using the "digest" method, why don't we just use a hash index? We can't directly create a unique index using hash, as said in the aforementioned documentation. However, for some reason that is yet unclear to me, we can create a constraint that automatically creates a hash index to support it. The same index can also be used for point selects.

Conclusion

The Simple Version

Same as TL;DR.

The Longer Version

TL;DR plus the following:

If you are storing long strings (greater than at least a few hundred characters on average) and/or searching with long patterns, use GIST instead of GIN, especially when your workload is comparatively write-heavy. If you know or can enforce the maximum length of the stored strings, a B-tree can be used for both uniqueness and point selects.

About Momen

Momen is a no-code web app builder, allows users to build fully customizable web apps, marketplaces, Social Networks, AI Apps, Enterprise SaaS, and much more. You can iterate and refine your projects in real-time, ensuring a seamless creation process. Meanwhile, Momen offers powerful API integration capabilities, allowing you to connect your projects to any service you need. With Momen, you can bring your ideas to life and build remarkable digital solutions and get your web app products to market faster than ever before.

Index