Disclaimer

Sunday, 15 November 2020

How to resolve the problem of the pg_wal location running out of disk space?

How to resolve the problem of the pg_wal location (once known as pg_xlog) running out of disk space?

 

Transaction logs are a core component of all modern Relational Database Management Systems. They are designed to deliver improved performance.

PostgreSQL refers to these transaction logs as Write Ahead Logs (WAL).

By writing transaction information out to these logs, PostgreSQL is able to return control to the client before needing to perform the time-consuming task of persisting data changes to disk; and at the same time retaining the ability to restore a consistent database in the event of failure.

To do this, it is critical that PostgreSQL is able to read and modify the transaction logs at all times. So should a situation arise where PostgreSQL is not able to access the transaction log, the only option it has is to shut itself down as cleanly and as quickly as possible and wait until a DBA fixes the cause of the problem.

PostgreSQL by default stores its transaction logs in the pg_wal folder of the data directory (Note: older versions of PostgreSQL used a folder called pg_xlog – this folder was renamed in version 10 to emphasise the importance of the file within).

All changes made to the database are first written to the transaction log before anything else can occur. When the pg_wal directory fills up and no new files can be created, your database will likely shut down and create a state where it can't be brought up again.

You will see the following messages after this problem arises.

WARNING:  terminating connection because of crash of another server process

DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit because another server process exited abnormally and possibly corrupted shared memory.

HINT:  In a moment you should be able to reconnect to the database and repeat your command. server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed.

 

  

Trying to connect a new session will result in this error:

-bash-4.2$ psql -p 5438 postgres psql: could not connect to server: No such file or directory

        Is the server running locally and accepting

        connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5448"?

 

Resolving the situation

This situation can be resolved by following one of two simple approaches described below. The first approach is reactive as it explains what to do once the situation has already occurred. The second is a more proactive approach and outlines a number of steps to avoid the situation occurring in the first place:

The database won't start due to the disk on which the WAL is stored becoming full.

 

This problem could be caused by a number of things:

Problemà1) Check whether the archiving is functioning properly.

Do this by ensuring the archive command is set correctly and the concerned directory has the required permissions.

If the WAL archiving option is in use, old WAL segments cannot be removed or recycled until they are archived. If WAL archiving cannot keep up with the pace that WAL files are generated, or if the archive_command fails repeatedly, old WAL logs will accumulate until the archiving issue is resolved.

The PostgreSQL logs will give you a pretty clear indication of the problem (for example, in the log snippet shown below the archive_command is failing repeatedly). The log will look similar to below:

sh: /archive/ist5/arc/0000000400000028000000CD: No such file or directory
LOG:  archive command failed with exit code 1 (11263)
DETAIL: The failed archive command was: gzip < pg_wal/0000000400000028000000CD >/archive/ist5/arc/0000000400000028000000CD (14792)
PANIC:  could not write to file "pg_wal/waltemp.4111": No space left on device
LOG:  server process (PID 4111) was terminated by signal 6: Aborted (11206)
LOG:  terminating any other active server processes (11203)
LOG:  archiver process (PID 4103) exited with exit code 1 (11204)
LOG:  connection received: host=[local] (11213)

FATAL:  the database system is in recovery mode (11191)
LOG:  all server processes terminated; reinitializing (11210)
LOG:  database system was interrupted; last known up at 2017-10-04 01:15:41 GMT
LOG: database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 29/8D242168 (10126)
LOG:  redo done at 29/9BFFFF88 (10127)
LOG:  checkpoint starting: end-of-recovery immediate
PANIC:  could not write to file "pg_wal/waltemp.4920": No space left on device
LOG:  startup process (PID 4920) was terminated by signal 6: Aborted (11206)
LOG:  aborting startup due to startup process failure (11201)

 

Steps to resolve an archiving issue:

P1-A. Thoroughly read the PostgreSQL documentation on how to set up archiving. This high-quality documentation has all the information required to properly set up archiving.

P1-B. pg_wal holds WAL files for the archiver to archive and hence is not deleting them, this will result in a disk full issue.

Once archiving is set-up properly, start PostgreSQL (pg_ctl start –D data_directory ) and PostgreSQL will automatically start catching up. All the WAL logs that are not required will be recycled which will free up space in pg_wal directory.

You can see this is the PostgreSQL log as shown below:

DETAIL: The failed archive command was: gzip pg_wal/00000004000000290000000A /archive/inst5/arc/00000004000000290000000A

gzip: stdout: No space left on device

LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 58 removed, 3 recycled; write=0.001 s, sync=0.000 s, total=0.170 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB

LOG: checkpoint starting: time

LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s) added, 113 removed, 1 recycled; write=0.000 s, sync=0.000 s, total=0.181 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB

*Note that, at this stage the database will not be available for normal operations.

 Problem à2)  If you do not have archiving in place (which is not recommended for any transactional databases) or archiving is working fine but you can see that pg_wal is full, try one of the following solutions.

Firstly, set pg_hba.conf (by commenting client connection entries or setting method=reject) so that connections cannot be made to the database whilst under maintenance. Don't forget to change it back when everything is back to normal.

P2-A. Add more space to the partition.
The quickest way to give more space to PostgreSQL is by adding more space to the partition where the pg_wal directory is located. However, this is only a solution when your O/S and file system is designed to allow resizing. If this is not possible then try the option B below.

P2-B. Move the pg_wal directory
Move the pg_wal directory to another partition with more space while the database is shut down. Then create a symbolic link pointing to a new location and restart the database. This is a fast and effective way of getting the database up and running to resolve the problem.

If needed, resize the original partition later where the pg_wal directory was located and follow the steps to put pg_wal back to original location.

Copy all WAL logs to new WAL location (e.g. /home/transaction/inst5/transactionlog).

cp -rf /database/inst5/pg_wal/* /home/transaction/inst5/transactionlog/

If possible, and you have space at some different mount point, then take a backup of pg_wal.

mv pg_wal /home/postgre/pg_wal_bkp

Since we have all the WAL logs at new location, create a symbolic link from pg_wal to the new location (/home/transaction/inst5/transactionlog).

ln -s /home/transaction/inst5/transactionlog/ pg_wal

Start PostgreSQL and check whether new WAL logs are being generated at the new location. If everything is working fine then you can safely delete the copy of pg_wal.

rm –fR /home/postgre/pg_wal_bkp

 

P2-C. A final way of getting space back to start PostgreSQL.

Assuming you have previously followed the proactive approach and have ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY file already in your pg_wal directory or on the same mount point.

Delete the abovementioned dummy file. This will free up some space in the pg_wal directory so you can bring up your database. In this way, we don't have to look for candidate WAL to be deleted to free up space (as we are doing in the next step).

Do a checkpoint after bringing it up. Ideally, the checkpoint should start to recycle unused WAL logs. Sometimes it happens that even after issuing a checkpoint, space will not be freed up. In that case, you need to follow the next step together with this step.

rm ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY
psql –d postgres
checkpoint;

Note: When everything is back to normal, don't forget to recreate this file in case you want to use it in future.

P2-D. If the above steps are not possible then follow this final option.

Our priority is to find the WAL log where the checkpoint is writing currently. We can't use commands like pg_current_wal_lsn() as the database is not accessible at the moment. Instead, use the following command and look for Latest checkpoint's REDO WAL file (e.g. 000000010000000F00000026)

pg_controldata -D /database/inst5
Database system identifier:                      6465070279055498266
Database cluster state:                          in production
pg_control last modified:                        Wed 13 Sep 2017 12:07:38 AEST
Latest checkpoint location:                      0/F826ACC8
Prior checkpoint location:                       0/F8266088
Latest checkpoint's REDO location:               0/F826ACC8
Latest checkpoint's REDO WAL file:               000000010000000F00000026

*where D is the PGDATA directory path.

Now we know the latest WAL so we can safely use it to remove all files older than WAL file name 000000010000000F00000026 using the pg_archivecleanup command in the pg_wal directory. It is always safe to execute a dry run (with option –n) first and then use
–d option to delete them later.

pg_archivecleanup -n /home/transaction/inst5/transactionlog 000000010000000F00000026

pg_archivecleanup -d /home/transaction/inst5/transactionlog 000000010000000F00000026

Once deleted, you can start the database and you will also notice that pg_wal gets space back.
Note: Once you have done all of the above, you should take a fresh backup of your cluster.

 

A proactive approach to avoid this situation:

  1. Have a proper backup of the cluster.

  2. Proper and periodic database and database server health check.

  3. Backup pg_wal regularly and free up space. Set email alerts for disk space utilisation according to set thresholds (70%, 90% disk full etc.) and react immediately to avoid this situation.

  4. Backup archives regularly. If archive storage size is a concern, you can use gzip to compress the archive logs:


archive_command = 'gzip < %p > /archive_dir/%f'

* Note: You will then need to use gunzip during recovery, like below:

restore_command = 'gunzip < /mnt/server/archivedir/%f > %p'

  1. Make a dummy file of some specific size (eg 300MB) using the following command in a pg_wal directory or anywhere else on the same disk mount point.

Sometimes, this file is very handy to have when we need the space back to get PostgreSQL running after a disk full issue. Note there is no need to find and delete candidate WAL logs as the first approach. While having a dummy file is not mandatory, it's good to be able to free up a little work space when you need it:


dd if=/dev/zero of=/database/inst5/pg_wal/
ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY  bs=1MB count=300

 

No comments:

Post a Comment

How to recovery PDB when PDB database is dropped in Oracle

  How to recovery PDB when PDB database is dropped :) [oracle@rac01 ~]$ sqlplus '/as sysdba' SQL*Plus: Release 21.0.0.0.0 - Product...