As enterprises modernize their applications and infrastructure in the interest of achieving greater agility, they are evaluating security solutions that would apply to the cloud-native platforms they’re adopting. But due to the ephemeral nature of containers, by the time anyone investigates the causes of an incident, the containers responsible could have already disappeared. Meanwhile, manual processes can’t keep up with how fast the components of applications and microservices talk to each other.
Machine learning (ML) can help you find an anomalous and malicious activity, and automate workflows so operations teams and developers can address the issues faster. Traditional security approaches, by comparison, can’t keep up with the speed or scale of these new architectures.
Unfortunately, ML’s widespread adoption has created excessive hype and confusion around what it can actually do. The vast majority of enterprise security solutions — including antivirus tools, firewalls, intrusion detection, and intrusion prevention systems — either use or claim to use ML to detect threats that traditional approaches can’t.
Who, or What, Learns What from Whom
Machine learning isn’t a panacea. Merely employing it within a threat detection engine cannot guarantee an improvement in your security posture. In fact, if used improperly, the results from ML can actually be detrimental, driving up both the noise a solution generates and the rate of false positives and false discoveries. Under the protective guise of a security platform, machine learning can be a double-edged sword.
Despite its name, “machine learning” is somewhat of a misnomer. No machine learns the way humans do — at least, not yet. As it exists today, ML is the practice of harnessing modern computational power to build mathematical and statistical models that explain patterns in data. This holds true for cybersecurity, whose engineers build models on security data to find patterns that indicate possible compromises within infrastructure or an operating environment.
Many of ML’s mathematical underpinnings were developed decades ago. For example, Bayesian statistics (which state propositions in terms of percentages of certainty) and neural networks (which derive patterns from large sets of imprinted data) actually date back to the 1940s. Recent advancements with parallelization in hardware, including with graphics processing units (GPUs) and FPGA accelerators, have made it possible for systems to process vastly larger data sets and build models that yield richer sets of behaviors than ever before. The mathematics behind these machine learning models has withstood the test of time, with more recent improvements and changes emerging from academic circles.
Although there are several ML algorithms designed to classify and predict threats, the effectiveness of models based on those algorithms depends heavily on the volume and quality of data that are fed into them. Put another way, machine learning can only be as good as the data you feed it.
The Predictive Power of Data
The dataset itself needs to have predictive power. For example, consider the task of building a basic model to predict whether a file operation exhibits malicious behavior. What are some relevant data points that may provide us with information on whether this behavior is indeed malicious?
- Whether the file is a valid executable
- Whether the file is located within a protected system folder
- Which permissions apply to the file
- Whether the file references system libraries
- Whether the file has a valid digital signature
Each of these indicators is likely to have direct predictive power and may help detect the malicious intent of a particular file operation. Other properties such as file size may have a more indirect predictive power. By itself, file size tells us nothing about whether a file modification was malicious. But viewed within a context of multiple different indicators, it could help us better understand the type of malicious behavior at play.
Ransomware files tend to be significantly larger than standard binaries. If your enterprise builds or uses legitimate binaries similar in size to a typical ransomware variant, then scanning and filtering binaries according to file size alone won’t protect you against a threat like this. But file size coupled with other information can provide critical context that enables a more confident ransomware threat detection. For example, file entropy is a measure of how closely the contents of a file resemble white noise or else follow more predictable patterns the way written text would.
Machine learning classifiers such as random forests (predictors of binary states based on clusters of decision trees) are well-suited to operation in multidimensional spaces, and can adeptly find patterns like these and classify them with a lower false positive rate than if you applied independent rules on each indicator. When the number of indicators runs into the thousands and beyond, exploring these relationships and building classifiers around them becomes an issue of scale. In this case, machine learning can quickly find correlations between several features that would otherwise be nearly impossible to do manually.
Beware of Spurious Correlations
There is a critical principle to consider when using machine learning to identify relationships between several indicators: Correlation does not necessarily imply causation.
This is an essential concept when the number of indicators is very large, and many of them don’t have a good amount of predictive power. What results is a model trained with supposedly high accuracy, but ends up being nothing more than a facet of the data set used to train it, and not representative of the overall relationship you’re trying to model (“selection bias”).
Author and statistics enthusiast Tyler Vigen maintains a website called Spurious Correlations, featuring several such examples where two trends are logically independent but show high statistical correlation. These models will almost invariably fall apart when they are tested against new data. Here’s an example correlating some drowning statistics with Nicolas Cage films:
Any machine learning model or system needs to be backed by solid data science work to vet not just the model, but also the data that it is trained on. Machine learning models can be updated to use new data that can improve efficacy in real time.
However, it’s important to note that with an ever-changing security landscape, the underlying data sources used to build models will invariably change as well, and certain features that are currently strong indicators of a security event may not continue to be so strong in the future. Everyday developments in cybersecurity may dictate how the data used to generate machine learning models needs to change. This way, this data may maintain a high degree of accuracy, along with a minimal false discovery rate.
StackRox sponsored this story.
Feature image: An actual random forest chart by Avi Yaschin, Senior Product Manager, IBM Watson Group, posted to Github.