ALERTING systems are an important tool for IT support teams. These teams are often responsible for maintaining and fixing services, so they need to be informed in a timely manner of any faults in order to keep services running at peak performance.
Traditionally, alerting systems consist of simple threshold-based anomaly detection processes, where the threshold value is manually set by the team. For example, the error count for each service is calculated over rolling time periods (say, five minutes) and if this count is greater, than or equal to, the threshold value then the team is alerted.
The main benefit of this approach is it enables the support team to have a degree of control over the alerting process (e.g. adjusting the thresholds according to planned activity). However, there are a couple of drawbacks:
- The threshold is static – it is the same value every day and every time, which may not be true of user activity;
- Low thresholds – usually chosen for services with lower user activity, they tend to alert on noise, or when there is an atypical burst of activity, as opposed to symptoms of a faulty service.
We decided to test a theory that Machine Learning could circumvent these drawbacks. An innovation project set up by the team prototyped an alerting system that dynamically and automatically determines if there is a genuine fault within the services.
Fundamental to the prototype is a predictive analysis tool that, for any given rolling 15-minute period, determines if the number of errors is typical or atypical.
It achieves this using statistical formulae to benchmark the real time error count value against historic data for that same 15-minute time period. Then, if the number of errors is atypical, the prototype computes if this is or isn’t noise. It achieves this by comparing the ratio of healthy calls to error calls in this 15-minute period.
Thus the prototype only alerts on a service if both the number of errors is atypical for the time period, and the error count is not the consequence of noise. In this way, the alerting of a service is controlled by an automated, unsupervised machine learning algorithm.
To determine if machine learning can improve the alerting process, we tested the effectiveness of this prototype by comparing it to a traditional alerting system over a period of 14 days, for a single service and a single type of error.
Fortunately, this was a period where there was a lot of alerting activity on this service. The traditional service alerted on 32 separate occasions. In contrast, the prototype alerted on only four occasions (these overlapped with four of the 32 occasions when the traditional system alerted).
Clearly, the prototype has the potential to reduce the number of alerts but is it missing any genuine system faults that the traditional system is alerting on? The team found the answer to this was ‘no.’ This answer was arrived at by comparing the ratio of healthy calls to error calls for each alerting period.
In the periods alerted by the traditional system, this percentage of unhealthy to healthy was consistently less than 0.5% , with a maximum of 2.93%. In contrast, in the periods alerted by both, this percentage was 50.2% at its lowest.
In conclusion, the prototype empirically demonstrates there is the potential for machine learning to seriously improve traditional alerting systems. Although this is good news, there are many other issues to consider in order to improve the prototype further. These include collecting more suitable historical data and improving the predictive analysis model.
Let us know if you’ve had any interesting findings or how you’ve used ML to improve service performance.