Global Big Data Conference

Industry News Details

Predictive analytics and machine learning: A dynamic duo Posted on : Dec 06 - 2016

Predictive analytics and machine learning are seen as the pair of tools to save the day for most organizations currently. We try to de-mystify both, taking a look at what they are, how they work, and what they are good for.

Predictive analytics and machine learning working separately or together can be just what a company needs to succeed. But understanding how they work is key to figuring out how they can help businesses thrive.

So, what is predictive analytics? Mark van Rijmenam uses the car metaphor, according to which traditional, descriptive analytics is like looking at the rear-view mirror to see what has happened, while predictive analytics is using a navigation system to tell you what will happen, and prescriptive analytics is a self-driving car that knows how to take you to your destination.

This metaphor, while easy to comprehend, may also be deceptively simple. It certainly is open to interpretation, so it's a good starting point for discussion. Some might say that a navigation system presumably has access to all the data regarding potential routes. So is suggesting a route based on that data really a prediction? Isn't that something algorithmic, deterministic, thus not really "intelligent"? Or is this a matter of definitions -- semantics?

It depends on how a navigation system is defined and how it works. Typically, navigation systems do not try to predict where do you want to go today. What they do instead is they wait to get specific instructions and then they figure out how to get from point A (either explicitly given as the starting point or calculated using GPS geo-location) to point B.

Let us examine a different example: Boarding Gate Readers (BGRs). BGRs are able to indicate whether a certain person should be granted access to a certain area of an airport at a certain time. For non-tech people, this is equally mystifying as a navigation system: how does the system "know" what to do, what the right answer/action is?

For techies, both examples are nothing to write home about: there is a database with all the information (streets and distances, passenger lists), there is an algorithm determining the output for the given input (fastest route from A to B, whether passenger X is in the list for flight Y), there is a medium that connects the system with the outside world (GPS position, bar-code reader). In fact, there is no real prediction involved in either system.

When looked at under that lens, these systems may differ in terms of implementation details and complexity of algorithms and data, but they are fundamentally not that far apart. Still, while few people in the tech industry would classify a BGR as a predictive system, presumably some would do so for a navigation system. Is the fact that BGRs respond with a binary (access/no access) answer, while a navigator responds with specific instructions a differentiating factor?

Machine Learning for the win?

To answer this, let's look at another example: identifying malware. As described by Kaspersky's Alexey Malanov, this used to be possible using rather straightforward algorithms and rules. At some point, the search space (i.e. the number of potential malware to identify) became so big and started expanding so fast that it was very hard to devise rules that would cover it in its entirety and keep up to date. Hence, enter Machine Learning (ML).

Malanov shows how ML can be used to perform the same task -- identifying malware -- more efficiently. The essence of how this works is by using an algorithm implementing heuristic rules based on metrics (in this case, letter sequence frequency) and a curated dataset to train the algorithm. The process is different, there are quite a few gotchas along the way, but the end result is basically the same: the ability to respond to input with a binary answer of malware/not malware.

So, is a navigator all that different? The two examples share some similarities -- they have a big search space and devising algorithms to cover it in its entirety is pretty hard. What Malanov's example shows is how a ML algorithm works as a function that classifies input into binary output. The same principle can be extended to non-binary outputs, such as choosing a route from A to B.

This is actually an optimization problem. The optimal solution for getting from A to B would be to drive in a straight line between the 2 points. This however is not possible, as there are only certain routes that offer unobstructed access from A to B. One way to approach this would be to encode a set of rules that define what is and is not possible when driving and then use the navigator's database in conjunction with the rules to figure out what a/the best way to get there.

The ML way to approach the same problem would be to get data on the routes people have used to go from A to B and use that to train an algorithm. In this case, there may be many alternatives for the same route, so simply responding with a "yes/no" would not do. But the same principle can be applied to classify inputs into more than two potential bins of outputs -- what is known in ML terminology as multiclass classification. A simplistic classification for potential routes could be something like "Impossible," "Bad," "Good," or "Optimal."

Predictions are hard, especially about the future

Presumably however, most navigators don't work utilizing ML -- at least not for their core function. Malanov touches upon some of the reasons why ML is not a panacea: False positives, Model bypass, Model update. While valid, these may not actually be the most serious drawbacks of using ML. There seems to be a widely popular misconception at the moment, that ML is something that automagically works out of the box -- you just need to throw data at it. But as Oren Etzioni of AI2 put it, "99% of machine learning is human work."

There is human work involved in finding, devising, selecting, and combining the right algorithms for the task at hand, in finding and appropriately labeling datasets to train the algorithms, in fine-tuning system parameters and so on. But equally importantly, there are cases for which ML is a great tool, others for which it is ill-suited, and others for which it needs to be combined with other techniques. View More

Get the