How to resolve the problem of the pg_wal location (once known as pg_xlog) running out of disk space?
Transaction logs are a core component of all modern
Relational Database Management Systems. They are designed to deliver improved
performance.
PostgreSQL refers to these transaction logs as Write
Ahead Logs (WAL).
By
writing transaction information out to these logs, PostgreSQL is able to return
control to the client before needing to perform the time-consuming task of persisting
data changes to disk; and at the same time retaining the ability to restore a
consistent database in the event of failure.
To
do this, it is critical that PostgreSQL is able to read and modify the
transaction logs at all times. So should a situation arise where PostgreSQL is
not able to access the transaction log, the only option it has is to shut
itself down as cleanly and as quickly as possible and wait until a DBA fixes
the cause of the problem.
PostgreSQL
by default stores its transaction logs in the pg_wal folder of the data directory (Note: older versions of
PostgreSQL used a folder called pg_xlog – this folder was renamed in version 10
to emphasise the importance of the file within).
All
changes made to the database are first written to the transaction log before
anything else can occur. When the pg_wal directory fills up and no new files can be created,
your database will likely shut down and create a state where it can't be
brought up again.
You
will see the following messages after this problem arises.
WARNING: terminating connection
because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit because another server process exited abnormally
and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command. server closed the connection unexpectedly This probably
means the server terminated abnormally before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Trying to connect a new session will result in this error:
-bash-4.2$ psql -p 5438 postgres psql:
could not connect to server: No such file or directory
Is the server running locally and
accepting
connections on Unix domain socket
"/var/run/postgresql/.s.PGSQL.5448"?
Resolving the situation
This
situation can be resolved by following one of two simple approaches described
below. The first approach is reactive as it explains what to do once the
situation has already occurred. The second is a more proactive approach and
outlines a number of steps to avoid the situation occurring in the first place:
The database won't
start due to the disk on which the WAL is stored becoming full.
This problem could be
caused by a number of things:
Problemà1) Check
whether the archiving
is functioning properly.
Do this by ensuring the archive command is set correctly and the
concerned directory has the required permissions.
If
the WAL archiving option is in use, old WAL segments cannot be removed or
recycled until they are archived. If WAL archiving cannot keep up with the pace
that WAL files are generated, or if the archive_command fails
repeatedly, old WAL logs will accumulate until the archiving issue is resolved.
The
PostgreSQL logs will give you a pretty clear indication of the problem (for
example, in the log snippet shown below the archive_command is failing
repeatedly). The log will look similar to below:
sh: /archive/ist5/arc/0000000400000028000000CD:
No such file or directory
LOG: archive command failed with exit code 1 (11263)
DETAIL: The failed archive command was: gzip <
pg_wal/0000000400000028000000CD >/archive/ist5/arc/0000000400000028000000CD
(14792)
PANIC: could not write to file "pg_wal/waltemp.4111": No space
left on device
LOG: server process (PID 4111) was terminated by signal 6: Aborted (11206)
LOG: terminating any other active server processes (11203)
LOG: archiver process (PID 4103) exited with exit code 1 (11204)
LOG: connection received: host=[local] (11213)
FATAL: the database
system is in recovery mode (11191)
LOG: all server processes terminated; reinitializing (11210)
LOG: database system was interrupted; last known up at 2017-10-04
01:15:41 GMT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 29/8D242168 (10126)
LOG: redo done at 29/9BFFFF88 (10127)
LOG: checkpoint starting: end-of-recovery immediate
PANIC: could not write to file "pg_wal/waltemp.4920": No space
left on device
LOG: startup process (PID 4920) was terminated by signal 6: Aborted
(11206)
LOG: aborting startup due to startup process failure (11201)
Steps to resolve an archiving issue:
P1-A. Thoroughly read the PostgreSQL documentation on how
to set up archiving. This high-quality documentation has all the information
required to properly set up archiving.
P1-B. pg_wal holds WAL files for the archiver to archive
and hence is not deleting them, this will result in a disk full issue.
Once
archiving is set-up properly, start PostgreSQL (pg_ctl start –D data_directory )
and PostgreSQL will automatically start catching up. All the WAL logs that are
not required will be recycled which will free up space in pg_wal directory.
You
can see this is the PostgreSQL log as shown below:
DETAIL:
The failed archive command was: gzip pg_wal/00000004000000290000000A
/archive/inst5/arc/00000004000000290000000A
gzip: stdout: No space left on device
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added, 58 removed, 3
recycled; write=0.001 s, sync=0.000 s, total=0.170 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction log file(s)
added, 113 removed,
1 recycled; write=0.000 s, sync=0.000 s, total=0.181 s; sync files=0,
longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB
*Note
that, at this stage the database will not be available for normal operations.
Problem à2)
If you do not have archiving in place (which is not
recommended for any transactional databases) or archiving is working fine but
you can see that pg_wal is
full, try one of the following solutions.
Firstly,
set pg_hba.conf (by
commenting client connection entries or setting method=reject) so that
connections cannot be made to the database whilst under maintenance. Don't
forget to change it back when everything is back to normal.
P2-A. Add
more space to the partition.
The quickest way to give more space to PostgreSQL is by adding more space to
the partition where the pg_wal directory
is located. However, this is only a solution when your O/S and file system is
designed to allow resizing. If this is not possible then try the option B
below.
P2-B. Move
the pg_wal directory
Move the pg_wal directory
to another partition with more space while the database is shut down. Then
create a symbolic link pointing to a new location and restart the database.
This is a fast and effective way of getting the database up and running to
resolve the problem.
If
needed, resize the original partition later where the pg_wal directory was located
and follow the steps to put pg_wal back
to original location.
Copy
all WAL logs to new WAL location (e.g. /home/transaction/inst5/transactionlog).
cp
-rf /database/inst5/pg_wal/* /home/transaction/inst5/transactionlog/
If
possible, and you have space at some different mount point, then take a backup
of pg_wal.
mv
pg_wal /home/postgre/pg_wal_bkp
Since
we have all the WAL logs at new location, create a symbolic link from pg_wal to
the new location (/home/transaction/inst5/transactionlog).
ln -s
/home/transaction/inst5/transactionlog/ pg_wal
Start
PostgreSQL and check whether new WAL logs are being generated at the new
location. If everything is working fine then you can safely delete the copy of pg_wal.
rm
–fR /home/postgre/pg_wal_bkp
P2-C. A final
way of getting space back to start PostgreSQL.
Assuming
you have previously followed the proactive approach and have ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY file
already in your pg_wal directory
or on the same mount point.
Delete
the abovementioned dummy file. This will free up some space in the pg_wal directory
so you can bring up your database. In this way, we don't have to look for
candidate WAL to be deleted to free up space (as we are doing in the next
step).
Do a checkpoint after bringing it up.
Ideally, the checkpoint should start to recycle unused WAL logs. Sometimes it
happens that even after issuing a checkpoint, space will not be freed up. In
that case, you need to follow the next step together with this step.
rm
ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY
psql –d postgres
checkpoint;
Note:
When everything is back to normal, don't forget to recreate this file in case
you want to use it in future.
P2-D. If the
above steps are not possible then follow this final option.
Our
priority is to find the WAL log where the checkpoint is writing currently. We
can't use commands like pg_current_wal_lsn() as
the database is not accessible at the moment. Instead, use the following
command and look for Latest
checkpoint's REDO WAL file (e.g. 000000010000000F00000026)
pg_controldata
-D /database/inst5
Database system identifier:
6465070279055498266
Database cluster state:
in production
pg_control last modified:
Wed 13 Sep 2017 12:07:38 AEST
Latest checkpoint location:
0/F826ACC8
Prior checkpoint location:
0/F8266088
Latest checkpoint's REDO location:
0/F826ACC8
Latest checkpoint's REDO WAL file:
000000010000000F00000026
*where
D is the PGDATA directory path.
Now
we know the latest WAL so we can safely use it to remove all files older than
WAL file name 000000010000000F00000026 using
the pg_archivecleanup command
in the pg_wal directory.
It is always safe to execute a dry run (with option –n) first and then use
–d option to delete them later.
pg_archivecleanup
-n /home/transaction/inst5/transactionlog 000000010000000F00000026
pg_archivecleanup -d /home/transaction/inst5/transactionlog
000000010000000F00000026
Once
deleted, you can start the database and you will also notice that pg_wal gets
space back.
Note: Once you have done all of the above, you should take a fresh backup of
your cluster.
A proactive approach to
avoid this situation:
- Have a
proper backup of the cluster.
- Proper and
periodic database and database server health check.
- Backup pg_wal regularly
and free up space. Set email alerts for disk space utilisation according
to set thresholds (70%, 90% disk full etc.) and react immediately to avoid
this situation.
- Backup
archives regularly. If archive storage size is a concern, you can use gzip to
compress the archive logs:
archive_command =
'gzip < %p > /archive_dir/%f'
* Note: You will then need to use gunzip during
recovery, like below:
restore_command =
'gunzip < /mnt/server/archivedir/%f > %p'
- Make a dummy
file of some specific size (eg 300MB) using the following command in a pg_wal directory
or anywhere else on the same disk mount point.
Sometimes, this file is very handy to have when
we need the space back to get PostgreSQL running after a disk full issue. Note
there is no need to find and delete candidate WAL logs as the first approach.
While having a dummy file is not mandatory, it's good to be able to free up a
little work space when you need it:
dd
if=/dev/zero of=/database/inst5/pg_wal/
ONLY_DELETE_THIS_DUMMY_FILE_IN_A_POSTGRES_EMERGENCY bs=1MB count=300
No comments:
Post a Comment