Momen is a cloud-native no-code IDE (integrated development environment) that we've been developing since early 2020. It aims to significantly lower the hurdle one has to overcome to build an application capable of serving requests from a non-insignificant crowd. To that end, we first needed to decide how the shared state of that application was to be stored—in other words, what database technology we would use.
We chose (the somewhat unconventional) PostgreSQL due to cost and SQL compatibility.
Serve at least 1 million users per app: For a reasonably active user base, that translates to roughly 100 million API invocations per day, and peak traffic of around 2 million API invocations per minute.
ACID compliance: Given that our target users were not going to be experienced backend software engineers, having ACID guarantees would be a must for the behavior of the system to be understandable. Explaining to a person “this group of modifications failed completely and none of them had any effect on the system” is much easier than telling the same person “the first two modifications succeeded but the last one failed, now you need to both detect this situation and provide a rollback strategy to repair the system. Oh, by the way, the rollback itself may also fail, so you need to rollback the rollback…”.
Reasonably cheap: Every dollar we save on infrastructure is a dollar saving that we can choose to either pass onto our customers or our bottom line. Since the database is usually going to be the bottleneck of a system most of the time, the understanding was that a significant proportion of infrastructure spending would be allocated to it.
Team familiarity: The team should know enough or be able to ramp up on that database so when problems would arise, especially performance-related problems, the team would know how to act to resolve such issues.
No cloud-vendor lock-in: From the get-go, we had decided that we needed to be cloud-vendor independent. We indeed started with everything on Aliyun, in hindsight, it was quickly demonstrated that having vendor independence paid off as we encountered clients that were located in different continents and clients that needed on-premises deployment. Any vendor-specific technology would have been a great hindrance.
Support for secondary index: - Without a secondary index, the only choice left for efficient filtering and sorting would be what I would consider an excessive amount of caching, along which would come resource consumption, and the unavoidable problem of cache invalidation/consistency, essentially implementing one’s secondary index in the application layer. One exception to this would be small datasets. Many no-code vendors do take this approach. For example, Mingdao, a fairly popular no-code platform, has tables whose search time scales linearly with the number of records in that table (unless one is searching using the primary key, but that primary key is auto-generated and devoid of real-world meaning). The result is, for million-row tables, search time goes beyond 10 seconds, effectively unusable for any consumer-facing application. To be fair, Mingdao never claimed to be consumer-facing in the first place though.
Appropriate level of abstraction: It is technically possible to work with a simple KV (key-value) store, though it would be for example quite nice if we could have typing so that the database would catch some of the type-related bugs, as well as aid in debugging.
Active development/community: So that we may benefit from future improvements.
SQL compatibility: SQL is a great declarative language that is also ubiquitous. It would give us at least three benefits.
Defers execution planning to the dedicated SQL engine.
Leverages a plethora of existing SQL-compatible tools.
A large portion of the workforce is trained in SQL, making onboarding easier.
Roughly grouped into two categories:
Relational databases (SQL):
MySQL family (MySQL, MariaDB)
PostgreSQL
Oracle
Amazon Aurora / Aliyun PolarDB
NoSQL databases:
MongoDB
HBase
Cassandra
Redis
DynamoDB (AWS) / TableStore (Aliyun)
We first ruled out the following databases for not meeting some hard requirements:
Amazon Aurora / Aliyun PolarDB: Vendor lock-in
Oracle: Unaffordable
HBase: No secondary index
Cassandra: No secondary index
Redis: Mostly in-memory and too expensive
DynamoDB / TableStore: Vendor lock-in
To begin choosing between MongoDB and Postgresql would not have been rational for me before I had familiarized myself with MongoDB in the first place, as I was no expert in this technology.
Between the SQL RDBMSs, the choice was easy for us:
Between the SQL RDBMSs, the choice was somewhat easy for us. They are generally very similar, so we focused on some details that we thought were important.
Since version 9.4 (2014) Postgresql has had support for JSON natively, supporting both B-Tree based index on the result of running an extraction function on a single attribute as well as the more generic GIN index, especially when combined with jsonb_path_ops, for efficient search for a non-predetermined key, key/value pairs. Since 5.7.8 (2015) MySQL relies on the creation of virtual columns and then indexes on such columns to achieve the same effect. Only after MySQL 8.0 was the support on indexing JSON-array-contains introduced. As of now, there is no support for generic JSON-object-contains.
Postgresql has a flexible TEXT type used for all strings with arbitrary length. MySQL has 4 different TEXT types for various lengths, CHAR type, and then VARCHAR types where one has to know the length of potential input strings in advance. This is in our opinion an unnecessary and unacceptable hurdle for our users.
Postgresql supports DDLs within a transaction, so during both migration (which we would be performing quite a lot of, automatically) and any necessary manual maintenance, we can be a lot more confident, as DDL statements either all ran or all rolled back. Such is impossible in MySQL.
Postgresql has this great LISTEN/NOTIFY feature, where one can have triggers that send notifications to any listeners on transaction commit. This is fantastic for “live query” semantics that we were going to implement for GraphQL subscriptions. It can also serve as an invalidation mechanism for any future caching layer. There is a Github project that tries to mimic this for MySQL, https://github.com/Cyclonecode/mysql-notification, but it requires building and installing a plugin, also there is no guarantee of transaction commit when notifications are sent.
Last but not least, we had and still have a pretty good in-depth understanding of Postgresql in the team.
To begin choosing between MongoDB and Postgresql would not have been rational for me before I had familiarized myself with MongoDB in the first place, as I was no expert in this technology.
To summarize my findings on its strengths:
MongoDB is schemaless even though it does come with different “collection”s, which are more or less equivalent to tables. But since no schema is enforced on anything, documents within each collection carry hints to the programmer what they contain. e.g. A collection of “employee”s probably represents an employee in real-life, with their names, DoBs, etc… The advantage of this approach is that one no longer has to design and define the schema for each entity, as well as that one theoretically no longer needs to run migrations.
MongoDB natively offers the ability to scale out in the form of replica sets and/or shards. Replica sets are quite commonly seen in the SQL world as well, essentially multiple nodes contain exact replicas of the data, offering increased read performance at the cost of a slight write performance penalty. Where it tries to differentiate itself is native sharding support. MongoDB lets you shard your dataset based on a sharding key, and provides you a slew of tools such as automatic rebalancing of shards to accomplish that goal.
MongoDB is widely adopted as the go-to database technology for many existing no-code / low-code solutions. Then I researched even further and started comparing to what can be achieved using Postgresql, things did not seem as clear-cut as they first had appeared.
Schemaless is fine and dandy until you start using that data. With strict schema enforcement afforded by a typical RDBMS, one is absolutely sure of what type each field is, (save when you declare everything as BLOBs). Without that certainty, during development, a hefty mental burden is placed on the developer, and typically results in a nasty array of if or case statements in multiple places, in which one then has to consider all possibilities, e.g. Is this field present, is it of type A or type B, etc… This problem is so prominent in “normal” backend development that we have native support in MongoDB itself for schema validation https://docs.mongodb.com/manual/core/schema-validation/.
Looking back at other no-code offerings, it is immediately obvious that all of them have a semantic schema implemented for their users because most of them take the form of a “spreadsheet on steroid”, where you naturally have the concept of tables each with a fixed list of columns. These columns in turn determine what type of data may be stored. Some data may indeed appear in denormalized form in such columns, like lists or even lists of record references, but that is usually the extent they differ from a traditional relation schema.
What we can conclude from this observation is that under the hood vendors do perform extensive schema maintenance, including but not limited to table creation/deletion/rename, column creation/deletion/rename to foreign key constraint enforcement. Alongside those, migration would be unavoidable from time to time as well. It is however not to say that MongoDB’s first strength we mentioned above is rendered completely moot given the application context.
Postgresql naturally enforces a SQL-compliant schema as we are now all aware. The downside compared to MongoDB still exists even given the requirement of a schema.
Schema changes in Postgresql are explicit in the data storage layer, so some operations require storage layer locking (detailed discussion on locks later), notably adding/removing columns, although this problem is not as bad as it may appear on the surface. (https://brandur.org/postgres-default) I ran an experiment on my personal Surface Book 2 using Postgresql 11. I first created a table and inserted 1 million rows, which occupied 1GB of disk space. Then I performed various schema-modifying operations repeated 10 times each, and took the average time taken:
Migration is more expensive on Postgresql than on MongoDB. Postgresql employs a row-based storage engine. Together with its lack of in-place update semantics (somewhat alleviated by Heap-only tuples A.K.A. HOT and The Oversized-Attribute Storage Technique A.K.A. TOAST), Postgresql incurs significant cost when performing updates, which is the core of any data migration process. This is not the case with MongoDB, where partial document update was built-in from the get-go. On a reasonably simple table consisting of two long fields, two text fields, and two timestamp fields, updating one timestamp field on 1 million records (the entire table), took more than 6 seconds on the same Surface Book 2. On MongoDB, the same operation took 5 seconds, better, but not by much.
Indeed, MongoDB offers great native support for horizontal scaling, in the form of both replica sets and shards, as we have mentioned above. But its support cannot escape CAP (consistency, availability, and partition-resistance) theorem. It is still subject to the same issues when it comes to distributed transactions and distributed joins, not to mention that it completely side-steps the issue of enforcement of foreign key constraints, as such was not offered to begin with. On the other hand, Postgresql’s support for horizontal scaling is not as barren as it might have appeared initially, all whilst offering much better single-node efficiency, to the point of eliminating the need for horizontal scaling (which comes at quite a hefty engineering and maintenance cost) for a great swath of applications.
Postgresql’s single-node performance is extremely impressive, especially for OLTP workloads with large datasets that do not fit in memory, compared to MongoDB (https://info.enterprisedb.com/rs/069-ALB-339/images/PostgreSQL_MongoDB_Benchmark-WhitepaperFinal.pdf). The TL;DR version is that Postgresql 11 is 25x-40x times faster. This effectively means that for OLTP, a single Postgresql node can handle the load that would crush a whole cluster of MongoDB nodes. In this experiment, 16 vCPU gets Postgresql 25k tps, considering AWS and Aliyun both offer instances with 128 vCPUs, it is not unreasonable to expect 100k — 200k tps on a single node for Postgresql. That is more than enough to support most applications, save giants like Amazon, Facebook.
We have adopted a single-tenant approach to system design. The probability of any single tenant outgrowing 100k — 200k tps is just that much smaller.
There are third-party solutions for Postgresql for horizontal scaling such as Citusdata, which is maintained by Microsoft now. One worry that may arise with a third-party solution may be the timeliness of updates. Given Citusdata’s corresponding update was released 1 day after Postgresql 14’s GA, https://www.citusdata.com/blog/2021/10/01/how-we-shipped-postgres-14-on-azure-within-a-day/, I would say that is pretty solid evidence that Microsoft is pretty committed to this product. Citusdata offers both replication and sharding. On top of that, it offers distributed transactions as well as an extensive, though not a complete set of foreign key constraint support. Other than Citusdata, there are other Postgresql-compatible solutions such as YugabyteDB.
MongoDB’s wide adoption is certainly helpful in talent acquisition from a company’s perspective, especially when that company is still in the start-up stage, where few candidates would familiarize themselves with a technology on purpose for a job opening.
Luckily, this is when Postgresql’s similarity to MySQL and other RDBMSs becomes a saving grace. SQL-compliant relational databases are just so ubiquitous, as soon as one relaxes their hiring criteria from “Postgresql familiarity” to “relational database familiarity”, the talent pool suddenly materializes.
MongoDB’s promise of freedom afforded by its schemalessness might have played a part. Instinctively, one associates no-code with no-schema. But as discussed above, no-code actually should place more restrictions on its users to ensure the predictability of the resulting software, our job is to reduce mental burden, not to increase it by affording them more freedom.
Most of the lowcode/nocode platforms are built as multi-tenant systems, so the need for horizontal scaling may arise much sooner compared to a single-tenant system. The tenancy itself can become a good candidate for a sharding key in many cases. Therefore natively-supported sharding can be perceived as a great advantage. None seemed particularly convincing to me, although I am of course open to differing opinions.
It comes down to efficiency, schema enforcement, and a built-in query planner/optimizer.
We have already alluded to Postgresql’s efficiency when discussing MongoDB, how its single-node performance can be 25x-40x that of MongoDB’s. One of my intuitions is that Postgresql makes much better use of available memory than MongoDB. There are plenty of complaints about how MongoDB consumes too much memory on StackOverflow. One example I can find is https://www.bmc.com/blogs/mongodb-memory-usage-and-management/, where 3.5GB worth of data was able to exhaust 8GB of memory under an unspecified query load. This surprised me. In hindsight, Postgresql has been exceptionally memory-friendly. In a past demo project, (though its scale surpasses most definition of a demo).
Its typical workload involves creating “exam”s that consist of a random selection of 100 multiple-choice “question”s, and then going through them, displaying each question’s contents as well as the choices. The result of answering any “question” is recorded for score calculation and error collection. It had a meager 1GB memory allocation, and shamefully, no proper tuning, i.e. No changes to random_page_cost, effective_cache_size, shared_buffers, or work_mem. Yet, it performs swiftly. Typical latency is under 50ms.
This efficiency translates directly to much lower hardware costs, perfectly meeting our “reasonably cheap” hard requirement.
Schema enforcement is great for catching errors early. I think that will become our saving grace when it comes to preventing data corruption. However, this conclusion is only founded on patterns I have recognized in my past experience and common sense, rather than any scientific study.
Unlike MongoDB, where the programmer has to algorithmically construct a query plan for any complex query, defined here as queries requiring at least one JOIN, Postgresql generates query plans for you and its query optimizer then picks the (hopefully) best plan based on statistics. While this can be roughly grouped under the “SQL-compatibility” category, this is really a great deal more useful than it may have appeared on the surface. Being a no-code platform, it is very difficult if not impossible for the programmer to know the distribution of data in each column, something that is essential to coming up with the appropriate query plan. And when it comes to query plans, the search space is often vast.
Filter first or join first?
Index intersection or nested loop index lookup?
Does hash join help?
Do we materialize the actual join?
…
It boils down to designing and implementing one’s own query optimizer, an effort that I would categories as “re-inventing the wheel”, and without any immediately obvious benefits.
Lock and bloat.
Locks are things that nobody can get around when it comes to concurrent access to mutable states. It is not to say this problem is exclusive to Postgresql it manifests in MongoDB as well. It is just a lot more likely to happen with Postgresql.
Altering a table itself is almost instantaneous. Common operations include add / drop columns. Both types need ACCESS_EXCLUSIVE lock on the table to be modified in Postgresql. In case another transaction is running, the ALTER TABLE command would then have a pending request on the table’s ACCESS_EXCLUSIVE lock, prevent any future access to that table until it finishes running, including reads. Such a scenario does not exist in MongoDB as no operation is needed for adding/dropping columns (other than perhaps deleting the columns’ content), and therefore side-stepping this issue entirely.
Bloat is a (notorious) side-effect of Postgresql’s simple MVCC implementation. It is the phenomenon when a table undergoes update or delete, its and its indices’ disk consumption increases, until VACUUM that is. Other than just increasing disk consumption, by virtue of increased index size, the working set also increases, therefore reducing overall performance. This is a somewhat unique problem to Postgresql. Some workarounds do exist, such as pg_repack, but they all come at the cost of increased maintenance effort / temporary resource consumption.
In the end, what swayed our decision towards Postgresql was cost and most importantly the existence of a mature query planner/optimizer. We thought the engineering effort we would have to expend for more sophisticated lock management and potential sharding management would be well worth the runtime savings and not having to “re-invent the wheel”.