Amit's Oracle DBA Blog: How to Investigate Node Reboot or Database System Restart in 10 Minutes

Thursday, 28 November 2024

How to Investigate Node Reboot or Database System Restart in 10 Minutes

A Node Reboot or Database System Restart is an important event that a DBA often needs to investigate. Whether it's a Single Instance Database or a Real Application Cluster (RAC) environment, the investigation process is crucial to identifying the cause and ensuring system stability. Here’s a simple step-by-step approach to investigate the cause of a Node Reboot or System Restart:

1. Check the System Log: `/var/log/messages`

The first step is to check the system log (/var/log/messages) to understand if the system reboot was caused by a node eviction, memory pressure, or a kernel bug.

2. System Restart Due to Node Eviction

Symptoms: If the system was evicted from the cluster by another node, you will see the SysRq : Resetting message in the logs. This indicates that the Clusterware (CRS) initiated the reboot as part of a node eviction.

Log Example:

Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed.
Feb 18 17:20:44 db01 kernel: type=1701

audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295 pid=8368 comm="ocssd.bin" sig=6 Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
Action: Investigate the node eviction further by collecting diagnostic logs from the evicted node and any other nodes in the cluster. This will help you pinpoint the reason for the eviction.

3. Node Reboot/System Restart Due to Memory Pressure

Symptoms: In cases where the system is under high load and memory pressure, the system may reboot without any node eviction. You won’t see a SysRq message but might see messages related to swap and memory usage.

Log Example:

Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free swap: 97972kB
Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176

from=::ffff:10.77.9.254 Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
Action: Investigate OS Watcher output (or similar tools) for high load or memory pressure leading up to the restart. Look for processes that might have consumed excessive resources, such as CPU, memory, or swap.

4. Node Reboot/System Restart Due to Linux Kernel Bug

Symptoms: If a kernel panic occurred, it will be recorded in the log with a panic message. This is typically caused by a bug in the Linux kernel.

Log Example:

---[ end trace 288cce3e7b8bd8ba ]---
Kernel panic - not syncing: Fatal exception
Pid: 6381, comm: Thread-13686 Tainted: G D 2.6.32-300.4.1.el5uek #1

Action: If you see a panic message in the logs, this typically indicates a kernel panic. You will need to involve the Linux team to further investigate the kernel crash or bug.

5. Important Steps for Investigation:

Check System Logs: Look for messages related to SysRq, memory usage, swap, or panic messages in /var/log/messages.
OS Watcher/Top Command: If the issue is memory-related, check OS Watcher or top command logs for any spikes in memory, CPU, or I/O activity.
Cluster Logs: For RAC environments, review Cluster Alert logs and other diagnostic files to understand node eviction events.
Kernel Panic: If a kernel panic is suspected, get logs for the panic error and escalate to the Linux team for detailed troubleshooting.

6. Conclusion

When investigating a node reboot or system restart, the first and most important step is to review the system logs (/var/log/messages). Understand whether the event was due to node eviction, memory pressure, or a kernel panic. This will help you identify the root cause quickly and take corrective actions to prevent recurrence. If needed, escalate to other teams (e.g., Linux) for further investigation in case of kernel-related issues.

This methodical approach allows a DBA to quickly pinpoint the cause of a reboot or eviction within 10 minutes and resolve the issue effectively.

Amit's Oracle DBA Blog

Disclaimer

Thursday, 28 November 2024

How to Investigate Node Reboot or Database System Restart in 10 Minutes

How to Investigate Node Reboot or Database System Restart in 10 Minutes

1. Check the System Log: `/var/log/messages`

3. Node Reboot/System Restart Due to Memory Pressure

4. Node Reboot/System Restart Due to Linux Kernel Bug

5. Important Steps for Investigation:

6. Conclusion

No comments:

Post a Comment

Understanding SQL Plan Baselines in Oracle Database

Labels

Disclaimer

Thursday, 28 November 2024

How to Investigate Node Reboot or Database System Restart in 10 Minutes

How to Investigate Node Reboot or Database System Restart in 10 Minutes

1. Check the System Log: /var/log/messages

3. Node Reboot/System Restart Due to Memory Pressure

4. Node Reboot/System Restart Due to Linux Kernel Bug

5. Important Steps for Investigation:

6. Conclusion

No comments:

Post a Comment

Understanding SQL Plan Baselines in Oracle Database

1. Check the System Log: `/var/log/messages`