About Linux server integration pre-built alerts
The Linux server integration provides a variety of pre-built alerts that you can use right away to begin troubleshooting issues. In this step of the journey, you’ll become familiar with these pre-built alerts and learn how to use them to address various problems.
Did you know?
If your machine is functioning properly, you won’t receive any alerts. No news is good news!
Node exporter alerts
Description: High CPU usage
What this means: This alert could mean that a process has failed or that a single node is overloaded.
What to do: Check that workloads are evenly distributed among all nodes and that all processes are operating as expected.
Description: Clock not synchronising
What this means: The system is currently unable to synchronize its internal clock with an external time source. This could lead to time discrepancies if the system’s clock drifts.
What to do: Check the Network Time Protocol (NTP) configuration and confirm that the node can reach the specified time server.
Description: Clock skew detected
What this means: The system’s internal clock is inaccurate and hasn’t self-corrected.
What to do: Check the Network Time Protocol (NTP) service to ensure it’s working correctly and the node can communicate with the designated time server.
Description: Disk IO queue is high
What this means: The system is currently experiencing a significant amount of disk input/output operations. This high level of disk activity can slow down the system’s performance.
What to do: Check that all processes are running as expected and consider spreading disk-intensive tasks across multiple nodes.
Description: Kernel is predicted to exhaust file descriptors limit soon
What this means: The kernel, the core component of the operating system, can only manage a limited number of open files simultaneously. The system is nearing this limit, which may cause problems with opening new files.
What to do: This is often caused by a process that’s opening many files and failing to close them properly.
Description: Number of conntrack are getting close to the limit
What this means: Conntrack is a component of the Linux firewall that keeps track of active network connections. The system is currently tracking a high number of connections, which could be a sign of a network problem or a potential security threat.
What to do: Analyze network traffic to determine the root cause.
Description: Host is running out of memory
What this means: A memory leak in a program could be causing high memory consumption.
What to do: Check that all processes are operating as expected. If applicable, fix the memory leak and restart the affected process, or distribute memory-intensive tasks across multiple nodes.
Description: Memory major page faults are occurring at very high rate
What this means: The system is heavily relying on disk swapping, which means it’s using more memory than is physically available. This significantly degrades performance.
What to do: A potential cause is a memory leak, which you should investigate and resolve.
Description: Network interface is reporting many receive errors
What this means: Network connectivity issues have been detected.
What to do: These could be caused by hardware malfunctions or malicious attacks.
Description: Network interface is reporting many transmit errors
What this means: Network connectivity issues have been detected.
What to do: These could be caused by hardware malfunctions or incorrect network settings.
Description: RAID Array is degraded
What this means: The RAID array is in a critical state and there’s a high risk of data loss.
What to do: To prevent data loss, repair, or replace the failed disks and rebuild the RAID array as soon as possible.
Description: Failed device in RAID array
What this means: One of the disks in the RAID array has failed.
What to do: While the array is currently operational, replacing the faulty disk is crucial to prevent potential data loss.
Description: System saturated, load per core is very high
What this means: All CPU cores are operating at maximum capacity, indicating excessive workload.
What to do: Consider distributing tasks across multiple servers.
Description: Systemd service keeps restarting, possibly crash looping
What this means: A particular service is experiencing repeated crashes.
What to do: Investigate and resolve the issue to ensure service stability.
Description: Systemd service has entered failed state
What this means: A specific service has failed and hasn’t restarted automatically.
What to do: Investigate and resolve the issue to restore service functionality.
Description: Node Exporter text file collector failed to scrape
What this means: A log file or status indicator that is typically used to gather data for a particular metric is currently unavailable. This is preventing the system from collecting and reporting the necessary data.
What to do: Consult the Grafana Alloy and system logs to determine which specific file is inaccessible.
Node exporter filesystem alerts
Description: Filesystem has less than 5% space left
What this means: The disk is almost full, indicating limited storage space.
What to do: Add storage capacity or remove unnecessary files to free up space.
Description: Filesystem has less than 3% space left
What this means: The disk is almost full, indicating limited storage space.
What to do: Add storage capacity or remove unnecessary files to free up space.
Description: Filesystem is predicted to run out of inodes within the next 24 hours
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp
directory.
Description: Filesystem is predicted to run out of inodes within the next 4 hours
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp
directory.
Description: Filesystem has less than 5% inodes left
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp
directory.
Description: Filesystem has less than 3% inodes left.
What this means: While there may be some free space remaining on the device, the maximum number of files that can be stored is almost reached.
What to do: This is often caused by numerous small files, which can be a symptom of a process that’s creating files without proper cleanup, particularly in the /tmp
directory.