Previous month:
February 2017
Next month:
April 2017

March 2017

Solving the Bigger Data Question in Cybersecurity

By guest author Cristina Ion 

Today, even the smallest company can generate huge sets of data. Fortunately, technology has kept pace with storage needs. With the dawn of Big Data, we are now able to store and analyze huge sets of digital information. What we must remember here is that, whereas this may appear to be a “Big Answer”, there is an even Bigger Question at stake.

Big Data is not about exploring and finding new sources of information: it's more like modern day archaeology: it is about using newly found methods to collect and unveil information that is already here. The purpose: extract Small Data - valuable insights - based on the interpretation of these modern data relics. Now, while all this sounds great in theory, we cannot help but ask ourselves: how do enterprises manage to transfer oodles of data, within and between networks, in a secure manner?

Today, cybersecurity experts are having a tough time analyzing data quickly and efficiently, and stealthy attacks often go unnoticed. What do IT execs do in this case? More often than not, they just hire more personnel. What’s one more additional person spending his/her time reviewing false positives? Not so sure about this approach, especially given the much hyped cybersecurity talent gap.  Threats become increasingly sophisticated, attackers change tactics, and the attack surface changes, so employing more staff may prove to be not only costly, but ineffective as well.

May the best robot win… or not

Machine learning (ML) moves security from an “if/then” paradigm in the development of modern security solutions to algorithm-based judgment calls that enable a system’s ability to be the referee when it comes to "similar to" situations. The effect is the same as when we switch between programming paradigms – from functional to imperative, for instance. A functional approach involves composing the problem as a set of functions to be executed, carefully defining the input to each function (the value returned is therefore entirely dependent on the input). With an imperative approach  to problem solving - referred to as algorithmic programming - a developer defines a sequence of steps/instructions that happen in order to accomplish the goal.

By definition a subset of Artificial Intelligence, machine learning can be supervised, unsupervised or semisupervised. As the names directly imply, each ML type involves a certain degree of involvement on behalf of the operator and demands a specific set of algorithms. Many voices say that, given how scarce experienced professionals in cybersecurity are becoming, the goal should be to replace them altogether with a sort of supreme Artificial Intelligence, capable of being omniscient and of rooting out all security threats – your typical Man versus Machine dystopian scenario, where the All Powerful AI wins. Translating this from fiction to fact: the world is waiting for that perfect unsupervised machine learning system, a system capable of knowing what we want to know before we even know it. And that’s where we tend to disagree.

As more and more robots and AIs are becoming better than humans at some jobs (find out “what are the 10 jobs robots already do better than you” here), cybersecurity is not your average occupation. While machine learning is awesome (there’s really no other word for it) and companies such as Facebook and Netflix have hit the jackpot with it, the issue is not the same when it comes to IT security. We neither want to be able to tag our photos better nor to receive more movie suggestions. In cybersecurity, we need to be able to detect unknown threats despite weak signals and to reduce this detection time to almost real-time – all aspects in which unsupervised machine learning does not excel. Leaving all decisions up to a ML-powered system will inevitably lead to alert fatigue, generating an incommensurable amount of potential threats – beyond even the analysis capacity of the best of us. Seeing how the average detection time of a breach can take months, something needs to change.

Machine Learning: the Jarvis to your Iron Man

Jarvis

If neither the machine, nor the man can fight alone against cyber-threats, why not combine forces? The goal shouldn’t be to replace humans with AI nor to leave it all to the AI. If we were to look for inspiration elsewhere, let’s say the Marvel universe, the best of superheroes are those whose powers had been enhanced by some not-so-realistic gadget. Whereas machine learning is far from being perfect, it has the potential to be a true side-kick for the expert analyst – the real-life (realistic) equivalent of Tony Stark’s artificially intelligent computer, JARVIS (Just A Rather Very Intelligent System).  JARVIS warns of potential dangers and dismisses them once Tony Stark makes a decision. This is the essence of ML: improve how we distinguish between normal and malicious behaviors over time, then inform the decision maker. Integrated in the Iron Man armor and Stark’s home defenses, JARVIS is the perfect metaphor for illustrating the symbiosis human/AI we should aspire to.

So where do you start? Well, first, for a more dramatic effect, put your Iron Man suit on. Then, try pinpointing the issue. Do you just need to detect compromised users? Or do you suspect you’ve been or you will be attacked? Either way, a specific use case needs to be developed. From there on, the data required to solve the problem needs to be identified. If you’re after advanced persistent threats, then look for information regarding the existing security and network infrastructure. Be sure to combine multiple sources (not necessarily more, just diverse) to get a 360° view of your user activity. If your machine learning analytics are multi-dimensional, you should be able to catch malware early in the kill-chain, spotting anomalies such as privilege escalation, lateral movement, data exfiltration, etc.

Finally, be patient. The core task of machine learning being to replicate and predict, it takes time. The system needs to gather enough data and feed it to its behavior analysis engines in order to achieve an accurate classification between normal and abnormal behaviors. Starting with a training set, a sample of good code and one of bad code, ML filters them with the help of statistical algorithms and, through multiple iterations, it slowly learns to distinguish between the two. We say “slowly”, but it’s actually incredibly fast compared to past technologies: known threats are identified almost instantly with the help of existing knowledge bases, while in the case of unknown threats it’s a matter of days (1 week with Reveelium, read our article here). But remember –  there are some behaviors that we still don’t know yet and, as such, we cannot teach them to the system. Also, while malware can be predicted this way with a high degree of probability, it is still the human in the Iron Man suit that has the last say in the matter.