PostgreSQL DB setup

These couple of blog posts will go through configuring a simple PostgreSQL VM cluster for a basic production workload. This by no means is a through, bulletproof and idiotproof setup for all production workloads. But it should give you a springboard for configuring a proper production setup. I will go through settings I believe that no matter what your workload is you will have to configure.

What I constitute a basic production grade PostgreSQL VM cluster to be is one consisting of a primary DB with at least one HOT standby and at least one additional backup (Barman) server. This blog will go through installing PostgreSQL on Ubuntu and configuring it (as a primary database).

These blogs will be written with PostgreSQL version 16 which at this time was the default version available via Ubuntu's package manager. And I highly recommend that you read the official documentation first so that you have some grasp about PostgreSQL server administration yourself. You should especially read these chapters:

Lastly we will not use the postgres superuser nor the postgres database in this guide as the targets for replication, archiving or general production usage. We will instead create a new user called prod_user and a new database called prod.

Installing PostgreSQL on Ubuntu

To install PostgreSQL on your Ubutnu (or any Debian derivative for that matter) simply run this command:

apt install postgresql

After this command finishes there are a couple of new things on your VM (in your case there may be a different version number):

A new user called postgres with a home directory under /var/lib/postgresql
Main configuration files located under /etc/postgres/16/main
The cluster data directory located under /var/lib/postgresql/16/main
A couple of new SystemD services

The postgres user should be generally used to perform cluster maintenance by the administrator due to the fact that it has access to all databases and can immediately log into any database using Unix domain sockets as specified by the pg_hba.conf file.

You may assign the user a password if you prefer. Furthermore it is generally not recommended to allow remote access to any database using the postgres user since it has superuser privileges inside the cluster.

Creating a user and database

Log into the postgres user. Once you are logged into the user execute psql. This will launch the psql shell with the postgres database and the postgres superuser.

Lets create the prod_user user and assign him all the necessary roles. We will give it the password prodpass and the login and createdb roles which allow the user to login and to create its own databases.

create role prod_user with login createdb password 'prodpass';

Now lets create the prod database and assign it to prod_user.

create database prod with owner = prod_user;

Now if you enter \l into the psql shell you should see your newly created user and database as the image bellow shows.

$PostgreSQL \l command output$

Now the database and user we are going to use for our VM cluster are ready.

Allowing remote access to the prod database

To allow remote access to the prod database we will edit the pg_hba.conf file and the postgresql.conf file located under the /etc/postgres/16/main directory. First edit the postgresql.conf by modifying the listen_addresses connection setting adding the value * instead of localhost (note this allows connections from all available IP interfaces to the cluster). Furthermore edit the pg_hba.conf by adding the line bellow to the end of the file. This will allow access to the prod DB using the prod_user and his password from ANY IPv4 address.

host prod prod_user 0.0.0.0/0 scram-sha-256

For this configuration to take effect you must restart the PostgreSQL cluster. Easiest way is to simply restart the postgresql.service.

This may look unsafe at first but consider that security is usually managed using a firewall and/or VPN in such environments or a more modern take than VPN would be a Zero Trust configuration.

Server configuration - memory, connections and WAL

Ubuntu sets default settings for your PostgreSQL cluster to be fairly low considering resource usage. Which is good if you don't have resources to spare but generally production databases have lots of both CPU and RAM available to them. Therefore to create a production ready cluster we must modify some server settings. Note that we will not discuss each and every setting. For that you can simply read the PostgreSQL official Server Configuration.

Connection management ~ max_connections

A fairly important connection setting is max_connections. Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections. Dully note that if you choose to increase this setting you should also configure the associated per connection memory settings.

Memory management

There are many memory settings but the most important ones to properly configure are:

shared_buffers - Should be set at 25% of your available RAM. The default is 128MB which is fairly low.
temp_buffers - This is a per connection setting. The more concurrent connections you have the larger the impact of this setting will be. The default is 8MB but multiply it by max_connections (which is your worst case scenario in case all connections are used). If your system runs large analytical queries or complex joins that need temporary storage, you may need to increase the default.
work_mem - A per connection query setting. In some cases a queries work_mem can exceed the specified value when for example executing HASH operations (influenced by the hash_mem_multiplier setting, default 2.0). The default value is 4MB. A good value for work_mem would be: (total RAM) * 0.25 / max_connections
maintenance_work_mem - This is also per connection, the default is 64MB. This can as you can see be much higher than work_mem since these maintenance operations are not used as much. A good value would be: (total RAM) * 0.05.

Setting the right amounts of memory is tricky but these suggestions should help you prepare for general production load. You may need to reconfigure these settings once production hits and you see which queries are executed more than others and whose explain plans indicate a need for reconfiguration.

WAL - Write ahead log

This server will be configured to be an asynchronous primary database. We will not wait for the standbys to confirm query consistency (aka we will not set the synchronous_commit value to remote_apply). But we will leave behind a large enough WAL backlog which in turn will help the standbys to catch up in case for example a network error occurs or a standby fails.

Settings we will change from their defaults are:

wal_sync_method - The default is fsync, but since we are using Ubutnu aka Linux we will set it to fdatasync.
wal_keep_size - This should be as high as you think a server could fail and need WAL files before successfully recovering. Consider that the more your database works e.g. the more transactions are executes the larger the WAL directory will grow. For example if you generate 1GB of WAL files in a day you can leave this as small as 200MB, but if you generate 1GB of WAL files in 10 minutes than this should be at least 10 to 20 GB. Depending on your RTO.

With these configurations we have a server ready for basic production load and for standbys to pull data from it.

Conclusion

This type of setup has generally proven good for small to medium-sized enterprise applications (around hundreds to tens of thousands transactions per hour) and up to some large enterprises (above hundred thousand transactions per hour). Usually the larger the applications usage of the DB or the number of applications using the DB the VM hosting the PostgreSQL DB should be stronger (more RAM and more CPU with higher server settings).

In the next blog post I will set up a HOT standby that uses streaming replication to stream WAL files from this primary database to itself.

Created at: 13-Dec-2024