Immature Machine Learning Models As A Threat Vector
Software is eating the world. The industry has evolved since that movement began in the 1960s.
First we digitized processes, then enriched them with additional information to provide more telemetry, and put them online. Now the knowledge of the world is just an endpoint away from any app we are developing, if we just know where to look. All of that information can now be combined and synthesized to produce the types of insights that wouldn’t have even been considered just a couple decades ago.
Along the way we got a cadre of computer scientists, mathematicians, and statisticians developing new and interesting ways to take increasingly complicated mathematical questions and put them into computational algorithms and effectively turn each into a series of Turing machines. The resultant explosion of machine learning capabilities as we found the intersectionality between hundreds of diverse fields has now given us the capability to generate predictive models at a cost and speed that now makes doing so profitable almost immediately for internal use cases.
Let’s take the simple example of a basic categorization tool. These frameworks took a couple of days to develop for generic categorization uses https://github.com/krypted/CoreML_CLI or https://github.com/krypted/lightweightcategorizer. Within only a few hours worth of work, a developer can take one of these (or many others) and ship a tool that analyzes text and creates a recommended category for a given object. Nifty, right? Based on the title of this article, it’s easy to imagine the response is both a yes and a no. Basic tools should be limited to basic, and internal, uses.
Now, let’s extend the example. Let’s manually train the model. This means each time there’s a correct match for a category we edit the model file. Maybe this means identifying why our brains chose to categorize an object with a given label and trying to create an entry in the file that weights the parameters that guided our choice. If the model produces a result that makes sense to our brains we add a line or multiple lines into that file that weight a category to a given set of parameters. If the model predicts the wrong category, we add additional information that indicates why. This allows the model to incrementally become more efficient.
The time it takes to train a model until it’s in a mature state (and the definition of mature) is different for most use cases. Maybe the model was 60% accurate on day one and becomes 80% accurate within a few days. Maybe we consider 80% accurate mature for a use case. Or maybe there are medical implications that mean we need an image analysis tool to be 99.999% accurate. Either way, we can quickly see machine learning augmenting the ability of a human by enabling us to make choices faster, provided we’re still hitting a button before committing to a given result, and the possible outcomes of that result.
Now, let’s take a few examples of some areas where machine learning can be useful to:
Predict how news about a company impacts their stock price using a sentiment analysis tool (e.g. gradient boosting)
Identify suspicious network traffic (anomaly detection)
Forecast purchases in a supply chain using linear regression analysis (logarithmic or polynomial) and bolt on additional variables to get to logistic regression
Detect stenography using a Support Vector Machine (SVM)
Use Convolutional Neural Networks (CNN) to perform facial recognition to open and close doors
Detect credit card fraud using K-means (e.g. K-Nearest Neighbor)
Block spam at the organization level with a Naive Bayes algorithm
Predict marketing trends using a Gaussian Mixture Model (GMM)
Determine an insurance rate using Principal Component Analysis (PCA)
A simple model might be able to augment our human ability to make decisions for each of these. But now let’s consider the potential threat that an overly-simplistic approach to machine learning can produce. Let’s say that a competitor, attacker, or even nation state realizes we are using these machine learning models to make decisions and takes an action to see what our results are. Maybe that’s putting out a bogus press release with a lot of negative sentiment about a company, attempting to drive down the stock price. Maybe that’s blocking an email from a supplier by tricking a spam filter into blocking emails with certain words in them. Maybe that’s reverse engineering the supply chain purchases a company makes to cause them to over (or under) purchase in a given area when prices are at a peak or valley.
The more simplistic our algorithms and models, the easier it is to reverse engineer our actions based on given data sets. This danger is amplified if we sell access to the data we make decisions based on. Security through obscurity has long been considered in poor taste - and yet the more data points we make available the easier it is to reverse engineer other data points, our decision trees, and ultimately our actions. Now if we attempt to short a stock based on an algorithm we may find others attempting to profit from outmaneuvering us based on our ability to pull in machine learning models that are accurate with a limited data set.
Now, let’s say we train our data internally for a year and get to 90% accuracy with our models. Two years to get to 94%. Three years to get to 95%. Five years to get to 97%. And along the way we’ve enriched our data sets and likely added a custom algorithm that leverages multiple models to derive the same results. And we’ve gone from importing all of a tool like Tensorflow to bringing in much more custom (and so performant) tooling. We’re efficient, likely moving further into deep learning, and we have made it far more difficult for others to predict what we will do with a level of accuracy that is weaponizable against us.
Many organization incur a fair amount of cost in anonymizing and preparing data for our own use in machine learning and some choose to pass that cost on by making that data available to other organizations for a fee. This isn’t an argument not to sell data. There are scenarios where making our data sets available makes the whole world better. But before we do so, we should understand the full impact in our risk management. Not only do we need to make sure we’re making data available in a compliant fashion - but we also need to take into account how the data helps derive algorithms and models, which if used by others not only reduces our competitive advantages but can also become a threat.
Imagine ordering 10,000 times the number of widgets we need, recommending offensive content to people, buying thousands of shares of a stock that’s about to tank, giving high-risk loans, becoming the underwriter for a block of people about to file tons of insurance claims, or any of the examples from the list of great use cases for machine learning earlier in the article. To be competitive we need hyper-automation based on machine learning. We just need to control our expectations for how quickly to expect results.
Machine learning is an important part of computer science’s quest to augment the human intellect. We didn’t learn calculus in the first grade. And we can’t expect machines to learn too quickly either. It’s better to be patient, train models, be deliberate about where and when we implement automation, and define when a model is accurate enough to predict an outcome given the risk threshold involved. That often means setting those expectations with those in the business so we have plenty of time (although preferably in a way they can see quantifiable outcomes at various milestones along our journey).
Computers may have helped our decision making - but they certainly haven’t made us more patient.