MC2: Secure Collaborative Analytics for Machine Learning
Machine Learning (ML) has gained prominence in recent years because of its ability to be applied across scores of industries and solve complex problems effectively. Yet, research shows that nearly 90% of AI/ML models never actually make it into production or hit the market. The main challenge is that ML/AI models require huge volumes of high-quality, accurate, and timely data to be effective, but organizations have long been reluctant to share sensitive information due to security and privacy concerns.
Personal data is becoming more pervasive, causing privacy concerns to grow. As a result, global data protection laws have become stricter, and organizations face increasingly higher noncompliance risks. Mitigating such concerns and taking AI/ML to the next level requires a new approach to collaboration — secure collaborative learning.
Secure collaborative learning enables multiple parties to build mutually robust ML models, without openly sharing sensitive data with each other. With this technology, banks can use these robust models to detect financial crime and money laundering. Health care organizations can improve clinical insights from multiple patient datasets without exposing sensitive information, and mobile network operators are able to predict fluctuations in call rates by collectively analyzing their traffic data.
After years of extensive research about this paradigm in the University of California Berkeley RISELab, co-creators Raluca Ada Popa and Rishabh Poddar developed the MC2 open source platform to address this key multiparty collaboration challenge.
MC2 (Multiparty Collaboration and Coopetition) enables rich analytics and machine learning on encrypted data, ensuring that data remains concealed even when it’s being processed. Through a temporary “black box” method via secure enclaves, the data being used remains confidential to the server running the job. It may sound contradictory, but it’s true: multiple data owners can jointly run analytics or train ML models on their collective data, without actually revealing that data to anyone else. This alleviates concerns around offloading confidential workloads to untrusted third parties or cloud providers. MC2 solves the tension between expanding cloud adoption, the need for data sharing, and the increasing concern over data privacy.
The rest of this article will detail the key technical aspects of this popular open source project charting the path towards secure, collaborative ML and AI.
A Software Stack That Powers Secure Enclaves
Secure enclaves enable the creation of a trusted execution environment (TEE), a domain where multiple parties can collaborate on confidential data, within an otherwise untrusted machine. Prior approaches dump data into a TEE and provide access to those who need it to collaborate, but that opens doors to hidden risks and third-party leakage that businesses simply can’t afford in this regulatory climate.
With secure enclaves, each enclave has access to a restricted portion of the system’s memory and data or software placed within the enclave is encrypted and isolated from the rest of the system. This creates an additional layer of security that protects against any intrusion, even from the system itself. Taking this even further, secure enclaves support remote attestation, which enables users to cryptographically verify that an enclave is running trusted, unmodified code.
MC2 seamlessly runs popular analytics and machine learning frameworks (Apache Spark, XGBoost, etc.) within enclaves securely and efficiently, abstracting away the complexities of writing enclave code from the end user. Additionally, MC2 handles partitioning so that only the components that need to compute directly on the sensitive data are automatically loaded into the enclave.
Lastly, MC2 fortifies those enclave components using cryptographic techniques in two ways:
- MC2 has built-in measures that verify the integrity of jobs that require distributed execution.
- Since developers will still need to monitor for and handle side-channel leakage and attacks with secure enclaves, MC2 uses data-oblivious techniques in enclave code to make sure that no side-channel information is leaked via memory access patterns.
MC2 gives both software and hardware data protection. Dual security reduces the risk of side-channel attacks, a key enclave vulnerability.
MC2 in Practice
At the onset of collaboration, each institution prepares the script that will run the computation. The script is the same for each organization and is agreed upon ahead of time.
As encrypted data is uploaded to the server, MC2 receives numerous local updates. The program trains a decision tree model on the encrypted data that is utilized for developing predictions. Through the aggregation of the local updates, MC2 produces a final algorithm based on the analysis of encrypted data collected from each party.
Once the algorithm is finalized, each organization downloads the results created from the encrypted data collection. This global model is what provides analytical insight. Even at this stage, each party will be unable to see the data from other organizations. They only have access to collective analysis, which they can then apply to their own private data set.
It may seem simple in practice, and that’s because it is! MC2 makes multiparty collaboration on encrypted data possible for anyone.
The Next Wave of Confidential Analytics
Yes, personal data is only becoming more pervasive, privacy concerns are growing daily, and subsequent data protection laws are becoming stricter. Yet, at the same time, organizations are realizing the enormous benefits of being able to share their data with each other — banks can collaborate to detect financial crime, health institutions can collaborate on medical studies, etc.
More than $300 billion of the world’s most valuable data remains untapped due to the lack of a secure processing environment where ML can’t be applied, and Gartner predicts that by 2025, over 50% of organizations will adopt privacy-enhancing computation to process sensitive data and conduct multiparty analytics, underscoring the importance of safe accessibility to encrypted data.
The confidential computing space will not slow momentum any time soon, and now is the time for enterprises to adopt this technology. With the amount of sensitive data increasing every day, the need for the MC2 platform has never been higher. Confidential computing and analytics on encrypted data will soon become a must for all industries looking to collaborate on sensitive information.