Postgresql HA

Solution choice

cloudnative-pg/cloudnative-pg: CloudNativePG is a comprehensive platform designed to seamlessly manage PostgreSQL databases within Kubernetes environments, covering the entire operational lifecycle from initial deployment to ongoing maintenance

Active community, high stars, fancy solution, tailored for K8s. An introduction: 4,300 GitHub Stars and Counting: CloudNativePG Now the World’s #1 PostgreSQL Operator for Kubernetes

Installation

Installation by Helm: cloudnative-pg/charts: CloudNativePG Helm Charts While we are installing cnpg/cloudnative-pg Chart with version 0.22.1, it refers to the app version to be 1.24.1 , which defaults to PG 17.0. 17.0 this version seems to be some bugs: [Bug]: Primary PostgreSQL Instance Not Gracefully Shutting Down and Missing WAL File · Issue #3680 · cloudnative-pg/cloudnative-pg That cause the replicas fail to finish WAL archiving(which misses a few WALs) when doing a backup after a failover on the primary. So we have to specify the docker image tag to 17.2.

As we use Incubator to create k8s manifests through helm releases, you should follow the syntax to define the resources. Here’s an example to install the cluster:

resources:
- apiVersion: postgresql.cnpg.io/v1
  kind: Cluster
  metadata:
    name: postgresql
  spec:
    imageName: ghcr.io/cloudnative-pg/postgresql:17.2
    instances: 3

    storage:
      storageClass: local-path # we use local-path-provisioner to provisioning storage
      size: 300Gi

    monitoring:
      enablePodMonitor: true # enable podmonitor, which will instructs promethues to crawle the metrics from pg

    backup:
      barmanObjectStore:
        destinationPath: s3://bucket_name/a/b/c
        endpointURL: https://fsn1.your-objectstorage.com # we use Hetzner Cloud Object Storage
        s3Credentials:
          accessKeyId:
            name: backup-creds
            key: ACCESS_KEY_ID
          secretAccessKey:
            name: backup-creds
            key: ACCESS_SECRET_KEY
          region:
            name: backup-creds
            key: REGION # defaults to fsn1
        wal:
          compression: gzip
          encryption: # Hetzner doesn't support encryption at this time, so we have to disable it
        data:
          compression: gzip
          encryption: 
          immediateCheckpoint: false
          jobs: 8
      retentionPolicy: "7d" # keep last 7 days backups
- apiVersion: postgresql.cnpg.io/v1
  kind: ScheduledBackup
  metadata:
    name: backup-schedule
  spec:
    immediate: true
    schedule: "0 0 * * * *" # back up every hour
    backupOwnerReference: self
    cluster:
      name: postgresql

Don’t forget to create a K8s secret named backup-creds which contains the Object Storage Access Key and Access Secret and then create this helm release.

Configuration

The configurations are shown in the above vaules file, we can go a little bit further to figure things out.

Replication

The architecture is one Primary two Secondary. At every point of time we will only get one primary which is responsible for read/write requests, while the Secondary accepts read queries only. We can use postgresql-rw to connect to the Primary while using postgresql-ro to connect to the Secondaries.

Failover

Find if there are any lags from stats: select * from pg_stat_wal_receiver; Or we can directly go to the Grafana Dashboard and watch if there’s any lags on the Secondary. If the Primary crashes(like Disk Corruption), the controller will detect it goes down, then it will trigger another Election that promote one Secondary to Primary. After the Primary recovers, it will be downgraded to Secondary and resync from the new Primary.

Backup

While we host our own K8s cluster on Bare Metals so there’s no way to utilize the Volume Snapshot function. Therefore, we choose to use Hetzner Object Storage as our backup source. One point is that the endpointURL should be like https://fsn1.your-objectstorage.com but not the AWS style https://bucket.region.xxxx.yyyy , the other is that the encryption should be null if you are using Hetnzer Object Storage(although they said that it’s S3 compatible but it’s not totally the same as AWS S3). If you are intended to use AWS S3, then forget about my words.

Recovery

If you want recovery from a backup, then the cluster values show be like this:

resources:
- apiVersion: postgresql.cnpg.io/v1
  kind: Cluster
  metadata:
    name: cluster-restore
  spec:
    imageName: ghcr.io/cloudnative-pg/postgresql:17.2
    instances: 3
    storage:
      storageClass: local-path
      size: 300Gi
    bootstrap:
      recovery:
        source: postgresql
        targetTime: "2024-12-10 08:35:50.00000+00" # although this field should work to do a Point-In-Time Recovery, but I failed to get it work. Have no idea, if you know how to, please let me know.
    externalClusters:
      - name: postgresql
        barmanObjectStore:
          destinationPath: s3://bucket_name/a/b/c
          endpointURL: https://fsn1.your-objectstorage.com
          s3Credentials:
            accessKeyId:
              name: backup-creds
              key: ACCESS_KEY_ID
            secretAccessKey:
              name: backup-creds
              key: ACCESS_SECRET_KEY
            region:
              name: backup-creds
              key: REGION
          wal:
            maxParallel: 8

The bootstrap.recovery.source should be the same as externalClusters[*].name so that controller can find corelated cluster and backup to recover. Then it will create a brand new cluster named cluster-restore with all the backup data and permissions. To be noticed, the bootstrap.recovery is mutal with backup.

Monitoring

How to install Prometheus Stack is beyond the scope of this post. Clounative-pg offers one out-of-box Grafana Dashboard where we can easily find out the running status and stats of this pg cluster. We can enable PodMonitor on installation and then import the Grafana Dashboard, then wait for a moment, the dashboard should be all right.

resources:
- apiVersion: postgresql.cnpg.io/v1
  kind: Cluster
  metadata:
    name: postgresql
  spec:
    ...
    monitoring:
      enablePodMonitor: true # enable podmonitor, which will instructs promethues to crawle the metrics from pg

Benchmark

We are using pgbench, a performance testing tool specifically for PostgreSQL, to perform stress tests. pgbench is geared towards evaluating performance through Transactions Per Second (TPS). The scope of this discussion is limited to TPS testing with pgbench and will not address Queries Per Second (QPS) measurements. All the tests were conducted under the instructions of Official Docs. The machine spec is 128G memory, AMD 7950 X3D and we didn’t set resource limits on the instances so the Qos should be Burstable.

Generate a datbase with 100000000 records

kubectl cnpg pgbench \
  --job-name pgbench-init \
  cluster-example \
  -- --initialize --scale 1000

99900000 of 100000000 tuples (99%) of pgbench_accounts done (elapsed 50.99 s, remaining 0.05 s)
100000000 of 100000000 tuples (100%) of pgbench_accounts done (elapsed 51.05 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 68.58 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 51.11 s, vacuum 0.16 s, primary keys 17.28 s).

1 Clinent, 1 Thread

$ kubectl cnpg pgbench \
  --job-name pgbench-run \
  postgresql \
  -- --time 30 --client 1 --jobs 1

$ kubectl logs -f jobs/pgbench-run

pgbench (17.2 (Debian 17.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 1
number of threads: 1
maximum number of tries: 1
duration: 30 s
number of transactions actually processed: 3535
number of failed transactions: 0 (0.000%)
latency average = 8.484 ms
initial connection time = 13.850 ms
tps = 117.872101 (without initial connection time)

10 Clients, 1 Threa

$ kubectl cnpg pgbench \
  --job-name pgbench-run \
  postgresql \
  -- --time 60 --client 10 --jobs 1

$ kubectl logs -f jobs/pgbench-run

pgbench (17.2 (Debian 17.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 558157
number of failed transactions: 0 (0.000%)
latency average = 1.074 ms
initial connection time = 38.780 ms
tps = 9308.454464 (without initial connection time)

30 Clients, 10 Threads

$ kubectl cnpg pgbench \
  --job-name pgbench-run \
  postgresql \
  -- --time 60 --client 30 --jobs 10
$ kubectl logs -f jobs/pgbench-run

pgbench (17.2 (Debian 17.2-1.pgdg110+1))
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1000
query mode: simple
number of clients: 30
number of threads: 10
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 239103
number of failed transactions: 0 (0.000%)
latency average = 7.526 ms
initial connection time = 26.742 ms
tps = 3986.131637 (without initial connection time)
ClientsThreadsTransactionsLatency (ms)TPS
1135358.484117.87
1015581571.0749308.45
30102391037.5263986.13

In summary, PostgreSQL is proving to be a highly performant database, delivering excellent read-write capabilities. The fact that it can handle high concurrency requests while maintaining near 4000 TPS with only 3 CPUs (using 30 clients and 10 threads) demonstrates its remarkable performance.