How to Investigate Node Reboot or Database System Restart in 10 Minutes
A Node Reboot or Database System Restart is an important event that a DBA often needs to investigate. Whether it's a Single Instance Database or a Real Application Cluster (RAC) environment, the investigation process is crucial to identifying the cause and ensuring system stability. Here’s a simple step-by-step approach to investigate the cause of a Node Reboot or System Restart:
1. Check the System Log: /var/log/messages
The first step is to check the system log (/var/log/messages
) to understand if the system reboot was caused by a node eviction, memory pressure, or a kernel bug.
2. System Restart Due to Node Eviction
- Symptoms: If the system was evicted from the cluster by another node, you will see the
SysRq : Resetting
message in the logs. This indicates that the Clusterware (CRS) initiated the reboot as part of a node eviction. - Log Example:Feb 18 17:20:42 db01 kernel: SysRq : Resetting
Feb 18 17:20:44 db01 kernel: printk: 6 messages suppressed. Feb 18 17:20:44 db01 kernel: type=1701
audit(1392744044.855:28194): auid=4294967295 uid=1000 gid=1001 ses=4294967295 pid=8368 comm="ocssd.bin" sig=6 Feb 18 17:24:26 db01 syslogd 1.4.1: restart.
- Action: Investigate the node eviction further by collecting diagnostic logs from the evicted node and any other nodes in the cluster. This will help you pinpoint the reason for the eviction.
3. Node Reboot/System Restart Due to Memory Pressure
- Symptoms: In cases where the system is under high load and memory pressure, the system may reboot without any node eviction. You won’t see a
SysRq
message but might see messages related to swap and memory usage. - Log Example:Feb 12 07:32:42 db02 kernel: Total swap = 25165816kB
Feb 12 07:32:42 db02 kernel: Free swap: 97972kB Feb 12 07:35:49 db02 xinetd[7315]: START: omni pid=8176
from=::ffff:10.77.9.254 Feb 12 07:57:57 db02 syslogd 1.4.1: restart.
- Action: Investigate OS Watcher output (or similar tools) for high load or memory pressure leading up to the restart. Look for processes that might have consumed excessive resources, such as CPU, memory, or swap.
4. Node Reboot/System Restart Due to Linux Kernel Bug
- Symptoms: If a kernel panic occurred, it will be recorded in the log with a
panic
message. This is typically caused by a bug in the Linux kernel. - Log Example:
---[ end trace 288cce3e7b8bd8ba ]--- Kernel panic - not syncing: Fatal exception Pid: 6381, comm: Thread-13686 Tainted: G D 2.6.32-300.4.1.el5uek #1
- Action: If you see a
panic
message in the logs, this typically indicates a kernel panic. You will need to involve the Linux team to further investigate the kernel crash or bug.
5. Important Steps for Investigation:
- Check System Logs: Look for messages related to
SysRq
, memory usage, swap, or panic messages in/var/log/messages
. - OS Watcher/Top Command: If the issue is memory-related, check OS Watcher or top command logs for any spikes in memory, CPU, or I/O activity.
- Cluster Logs: For RAC environments, review Cluster Alert logs and other diagnostic files to understand node eviction events.
- Kernel Panic: If a kernel panic is suspected, get logs for the panic error and escalate to the Linux team for detailed troubleshooting.
6. Conclusion
When investigating a node reboot or system restart, the first and most important step is to review the system logs (/var/log/messages
). Understand whether the event was due to node eviction, memory pressure, or a kernel panic. This will help you identify the root cause quickly and take corrective actions to prevent recurrence. If needed, escalate to other teams (e.g., Linux) for further investigation in case of kernel-related issues.
This methodical approach allows a DBA to quickly pinpoint the cause of a reboot or eviction within 10 minutes and resolve the issue effectively.
No comments:
Post a Comment