What is IT systems monitoring?

Put simply, the term “IT monitoring” refers to any processes and tools you use to determine if your organization’s IT equipment and digital services are working properly. Monitoring helps to detect and help resolve problems — all sorts of problems.

Today, monitoring is complicated. That’s because our systems and architecture are complicated — the IT systems we use are distributed. (Just like the people we work with are, too.)

Let’s look at a couple official definitions.

Google’s SRE book defines monitoring as the “collecting, processing, aggregating, and displaying real-time quantitative data about your system”. This data can include query counts and types, error counts and types, processing times, and server lifetimes.

In ITIL^® 4, information about service health and performance falls under the “Monitoring and Event Management” practice. They define monitoring as a capability that enables organizations to:

Respond appropriately to past service-affecting events.
Take proactive action to prevent future adverse events.

Monitoring is closely linked with many of the IT service management (ITSM) practices including incident management, problem management, availability management, capacity and performance management, information security management, service continuity management, configuration management, deployment management, and change enablement.

What to monitor in IT systems

IT systems monitoring is about answering two fundamental questions: what is happening, and why it is happening, often times in a reactive way: an alert is triggered due to a system malfunction, which is displayed so the engineer can act upon the case.

Metrics are the sources of raw measurement data that is collected, aggregated, and analyzed by monitoring systems. IT system metrics range across multiple layers, including:

Low-level infrastructure metrics: These are measured at the level of host, server, network and facilities, and include CPU, disk space, power and interface status among others.
Application metrics: These are measured at software level and include response time, error rate, and resource usage among others.
Service level metrics: These are infrastructure-, connectivity-, application-based and service action-based, where applicable.

Monitoring based on low level infrastructure metrics is known as “black-box monitoring”. This is generally the preserve of system administrators and DevOps engineers. At the application level, the term “white-box monitoring” applies, and is usually the work of developers and application support engineers.

IT system monitoring metrics are usually sourced from native monitoring features that are designed and built within the IT components being observed.

Beyond that, some IT monitoring systems deploy the use of custom-built instrumentation (such as lightweight software agents) which can extract more advanced service level metrics.

Four golden signals

According to Google there are four golden signals that should be the focus for IT systems monitoring:

Latency. The time it takes to service a request, i.e. the round-trip time usually in milliseconds. The higher the latency, the poorer the level of service being experienced — this is where users complain about slowness and lack of responsiveness.
Traffic. A measure of how much demand is being placed on your system, i.e. requests handled or the number of sessions within a period of time, taking up configured capacity. As the traffic increases, so does the stress on IT systems, and the potential to affect customer experience.
Errors. The rate of requests that fail, either explicitly, implicitly, or by policy. Errors point to configuration issues or failure by elements within the service model.
Saturation. A measure of the system fraction, emphasizing the resources that are most constrained, i.e. how “full” the service is. Exceeding the set utilization levels would likely lead to performance issues.

Best practices for alert fatigue

As system administrators set up monitoring systems to capture more data, they run the risk of being overwhelmed by:

The quantity of alerts being paged.
The complexity of relating alerts and logs.

It is a good practice to set up simple, predictable, and reliable rules that catch real issues more often than not.

Future trends in Monitoring

Impact of ML and AI

The impact of AI/ML on IT systems monitoring will continue to grow especially given the rising capability of large language models (LLMs). Modern tools that have integrated AI can now handle the entire process lifecycle from detection to response, especially for large event data volumes analysis, as well as handling of tedious activities such as event correlation and log analysis across distributed systems.

With appropriate training, these tools are perfectly suited to sort through alert “noise” and “false positives/negatives” faster and more effectively than any human team. However, this does not mean the total elimination of people from IT systems monitoring — instead, their focus will shift to building better orchestration and automation tools to respond to alerts and resolve them.

Unified observability

The other trend that impacts IT systems monitoring is the advent of unified observability. The rise of platforms that provide a single view — across infrastructure, applications, and user experience — by analyzing logs, metrics and traces means there’s a valuable magnifying glass available to you: more thorough analysis of alerts to pinpoint the exact issues that users are facing across complex environments.

Onze dedicated IT specialist aan het woord

Bij Conscia Belgium is het ons doel om u altijd zo precies en transparant mogelijk in te lichten over de laatste nieuwigheden. Met echte IT experten met hands-on kennis bent u zeker dat u altijd juist geïnformeerd bent.

Dit blogartikel werd geschreven door Victor Murk

Gerelateerde onderwerpen

Mogelijk vindt u dit ook interessant

Blog

What is IT systems monitoring?