The Role of Machine Learning in Data Management
The proliferation of new modern applications built upon Hadoop and NoSQL creates new operational challenges for IT teams regarding security, compliance, and workflow resulting in barriers to broader adoption of Hadoop and NoSQL. Unprecedented data volume and the complexity of managing data across complex multi-cloud infrastructure only further exacerbates the problem. Fortunately, recent developments in machine learning based data management tools are helping organizations address these challenges.
The Big Data Management Challenge
Big Data platforms such as Hadoop and NoSQL databases started life as innovative open source projects, and are now gradually moving from niche research-focused pockets within enterprises to occupying the center stage in modern data centers.
These Big Data platforms are complex distributed beasts with many moving parts that can be scaled independently, and can support extremely high data throughputs as well as a high degree of concurrent workloads; they match very closely the evolving needs of enterprises in today’s Big data world.
But because these platforms are evolving, they don’t have the same level of policy rigor that’s taken for granted in traditional record-of-truth platforms such as Relational Database Management Systems (RDBMSs), email servers and data warehouses.
The sheer volume and varieties of today’s Big Data lends itself to a machine learning-based approach, which reduces a growing burden on IT teams that will soon become unsustainable. This carries a number of risks to the enterprise that may undermine the value of adopting newer platforms such as NoSQL and Hadoop, and that’s why I believe machine learning can help IT teams undertaking the challenges of data management. Next, let’s look in more detail at these key operational challenges.
Security, Auditing and Compliance
From a security and auditing perspective, the enterprise readiness of these systems is still rapidly evolving, adapting to growing demands for strict and granular data access control, authentication and authorization, presenting a series of challenges.
Firstly, Kerberos, Apache Ranger and Apache Sentry represent several of the tools enterprises use to secure their Hadoop and NoSQL databases, but often these are perceived as complex to implement and manage, and disruptive in nature. This may simply be a function of product maturity and/or the underlying complexity of the problem they are trying to address, but the perception remains nonetheless.
Secondly, identifying and protecting critical Personally Identifiable Information (PII) from leaking is a challenge as the ecosystem required to manage PII on Big Data platforms hasn’t matured yet to the stage where it would gain full compliance confidence.
The sheer volume and varieties of today’s Big Data lends itself to a machine learning-based approach, which reduces a growing burden on IT teams that will soon become unsustainable.
Finally, Big Data DevOps groups typically struggle with managing the sheer number of workloads running on their systems. These could be Extract, Transform and Load (ETL) processes, backup jobs, model computations, recommendation engines, and other analytics workflows.
Then, there’s the challenge of calculating the best times to run jobs such as backups or test/dev in order to ensure business mandated RPOs are being met. This can be an extremely difficult exercise given the chaotic nature and number of varied workloads running at any time.
Invariably, developers and data scientists tend to make ad-hoc copies of data for their individual needs, being unmindful of what critical PII is getting exposed in the process. To mitigate this problem, organizations may resort to barring anyone from making copies of production data, forcing developers and data scientists to rely on synthetically generated data, which results in poorer quality tests and models since synthetic data isn’t usually representative of the production data.
Similarly, rule-based systems can only go so far in alleviating some of these problems because it isn’t possible to encode everything in rules in a highly dynamic environment. Instead, intelligent machine learning driven approaches must supplant humans and rule-based systems for automating many of the data management tasks in the new world of big data.
Possible Applications of Machine Learning in Data Management
For CIOs and CISOs worried about security, compliance and scheduling SLAs, it’s critical to realize that ever-increasing volumes and varieties of data, it’s not humanly possible for an administrator or even a team of administrators and data scientists to solve these challenges. Fortunately, machine learning can help.
A variety of machine learning and deep learning techniques may be employed to accomplish this. Broadly speaking, machine/deep learning techniques may be classified as either unsupervised learning, supervised learning, or reinforcement learning:
- Supervised learning involves learning from data that is already “labeled” i.e., the classification or “outcome” for each data point is known in advance.
- Conversely, unsupervised learning, such as k-means clustering, is used when the data is “unlabeled,” which is another way of saying that the data is unclassified.
- Reinforcement learning relies on a set of rules or constraints defined for a system to determine the best strategy to attain an objective.
The choice of which technique will be driven by what problem is being solved. For example, a supervised learning mechanism such as random forest may be used to establish a baseline, or what constitutes “normal” behavior for a system, by monitoring relevant attributes, then use the baseline to detect anomalies that stray from the baseline. Such a system could be used to detect security threats to the system. This is especially relevant for identifying ransomware attacks that are slow-evolving in nature and don’t encrypt data all at once but rather gradually over time. Random forest (as well as Gradient Boosted Tree) techniques could also be used to solve the aforementioned workflow scheduling problem by modeling the system load and resource availability metrics as training attributes and from that model determine the best times to run certain jobs.
However, oftentimes the initial training data used in model creation will be unlabeled, thus rendering supervised learning techniques useless. While unsupervised learning may seem like a natural fit, an alternative approach that could result in more accurate models involves a pre-processing step to assign labels to unlabeled data in a way that makes it usable for supervised learning.
Another interesting area of research is using deep learning to identify, tag and mask PII data. While regular expressions and static rules may be used for this purpose, using deep learning allows learning of the specific formats (even custom PII types) used in an organization. Convolutional Neural Nets (CNNs) have been successfully used for image recognition, so exploring their usage for PII compliance is another interesting possibility.
Big Data represents an enormous opportunity for organizations to become more agile, reduce cost, and ensure compliance, but only if they are able to successfully deploy and scale their big data platforms. Machine learning represents an exciting new technology that is poised to play a key role in helping organizations address these data management challenges.
Feature image via Pixabay.