The Architect’s Guide to Storage for AI
Choosing the best storage for all phases of a machine learning (ML) project is critical. Research engineers need to create multiple versions of datasets and experiment with different model architectures. When a model is promoted to production, it must operate efficiently when making predictions on new data. A well-trained model running in production is what adds AI to an application, so this is the ultimate goal.
As an AI/ML architect, this has been my life for the past few years. I wanted to share what I’ve learned about the project requirements for training and serving models and the available options. Consider this a survey. I will cover everything from traditional file systems variants to modern cloud native object stores that are designed for the performance at scale requirements associated with large language models (LLMs) and other generative AI systems.
Once we understand requirements and storage options, I will review each option to see how it stacks up against our requirements.
Before we dive into requirements, reviewing what is happening in the software industry today will be beneficial. As you will see, the evolution of machine learning and artificial intelligence are driving requirements.
The Current State of ML and AI
Large language models (LLMs) that can chat with near-human precision are dominating the media. These models require massive amounts of data to train. There are many other exciting advances concerning generative AI, for example, text-to-image and sound generation. These also require a lot of data.
It is not just about LLMs. Other model types exist that solve basic lines of business problems. Regression, classification and multilabel are model types that are nongenerative but add real value to an enterprise. More and more organizations are looking to these types of models to solve a variety of problems.
Another phenomenon is that an increasing number of enterprises are becoming SaaS vendors that offer model-training services using a customer’s private data. Consider an LLM that an engineer trained on data from the internet and several thousand books to answer questions, much like ChatGPT. This LLM would be a generalist capable of answering basic questions on various topics.
However, it might not provide a helpful, detailed and complete answer if a user asks a question requiring advanced knowledge of a specific industry like health care, financial services or professional services. It is possible to fine-tune a trained model with additional data.
So an LLM trained as a generalist can be further trained with industry-specific data. The model would then provide better answers to questions about the specified industry. Fine-tuning is especially beneficial when done with LLMs, as their initial training can cost millions, and the fine-tuning cost is much cheaper.
Regardless of the model you are building, once it is in production, it has to be available, scalable and resilient, just like any other service you deploy as a part of your application. If you are a business offering a SaaS solution, you will have additional security requirements. You will need to prevent direct access to any models that represent your competitive advantage, and you will need to secure your customers’ data.
Let’s look at these requirements in more detail.
Machine Learning Storage Requirements
The storage requirements listed below are from the lens of a technical decision-maker assessing the viability of any piece of technology. Specifically, technologies used in a software solution must be scalable, available, secure, performant, resilient and simple. Let’s see what each requirement means to a machine learning project.
Scalable: Scalability in a storage solution refers to its ability to handle an increasing amount of storage without requiring significant changes. In other words, scalable storage can continue to function optimally as capacity and throughput requirements increase. Consider an organization starting its ML/AI journey with a single project. This project by itself may not have large storage requirements. However, soon other teams will create their initiatives. These new teams may have small storage requirements. However, collectively, these teams may put a considerable storage requirement on a central storage solution. A scalable storage solution should scale its resources (either out or up) to handle the additional capacity and throughput needed as new teams onboard their data.
Available: Availability is a property that refers to an operational system’s ability to carry out a task. Operations personnel often measure availability for an entire system over time. For example, the system was available for 99.999% of the month. Availability can also refer to the wait time an individual request experiences before a resource can start processing it. Excessive wait times render a system unavailable.
Regardless of the definition, availability is essential for model training and storage. Model training should not experience delays due to lack of availability in a storage solution. A model in production should be available for 99.999% of the month. Requests for data or the model itself, which may be large, should experience low wait times.
Secure: Before all read or write operations, a storage system should know who you are and what you can do. In other words, storage access needs to be authenticated and authorized. Data should also be secure at rest and provide options for encryption. The hypothetical SaaS vendor mentioned in the previous section must pay close attention to security as they provide multitenancy to their customers. The ability to lock data, version data and specify retention policy are also considerations that are part of the security requirement.
Performant: A performant storage solution is optimized for high throughput and low latency. Performance is crucial during model training because higher performance means that experiments are completed faster. The number of experiments an ML engineer can perform is directly proportional to the accuracy of the final model. If a neural network is used, it will take many experiments to determine the optimal architecture. Additionally, hyperparameter tuning requires even further experimentation. Organizations using GPUs must take care to prevent storage from becoming the bottleneck. If a storage solution cannot deliver data at a rate equal to or greater than a GPU’s processing rate, the system will waste precious GPU cycles.
Resilient: A resilient storage solution should not have a single point of failure. A resilient system tries to prevent failure, but when failures occur, it can gracefully recover. Such a solution should be able to participate in failover and stay exercises where the loss of an entire data center is emulated to test the resiliency of a whole application.
Models running in a production environment require resiliency. However, resiliency can also add value to model training. Suppose an ML team uses distributed training techniques that use a cluster. In that case, the storage that serves this cluster, as well as the cluster itself, should be fault tolerant, preventing the team from losing hours or days due to failures.
Simple: Engineers use the words “simple” and “beauty” synonymously. There is a reason for this. When a software design is simple, it is well thought out. Simple designs fit into many different scenarios and solve a lot of problems. A storage system for ML should be simple, especially in the proof of concept (PoC) phase of a new ML project when researchers need to focus on feature engineering, model architectures and hyperparameter tuning while trying to improve the performance of a model so it is accurate enough to add value to the business.
The Storage Landscape
There are several storage options for machine learning and serving. Today, these options fall into the following categories: local file storage, network-attached storage (NAS), storage-area networks (SAN), distributed file systems (DFS) and object storage. In this section, I’ll discuss each and compare them to our requirements. The goal is to find an option that measures up the best across all requirements.
Local file storage: The file system on a researcher’s workstation and the file system on a server dedicated to model serving are examples of local file systems used for ML storage. The underlying device for local storage is typically a solid-state drive (SSD), but it could also be a more advanced nonvolatile memory express drive (NVMe). In both scenarios, compute and storage are on the same system.
This is the simplest option. It is also a common choice during the PoC phase, where a small R&D team attempts to get enough performance out of a model to justify further expenses. While common, there are drawbacks to this approach.
Local file systems have limited storage capacity and are unsuitable for larger datasets. Since there is no replication or autoscaling, a local file system cannot operate in an available, reliable and scalable fashion. They are as secure as the system they are on. Once a model is in production, there are better options than a local file system for model serving.
Network-attached storage (NAS): NAS is a TCP/IP device connected to a network that has an IP address, much like a computer. The underlying technology for storage is a RAID array of drives, and files are delivered to clients via TCP. These devices are often delivered as an appliance. The compute needed to manage the data and the RAID array are packaged into a single device.
NAS devices can be secured, and the RAID configuration of the underlying storage provides some availability and reliability. NAS uses data transfer protocols like Server Message Block (SMB) and Network File System (NFS) to encapsulate TCP for data transfer.
NAS devices run into scaling problems when there are a large number of files. This is due to the hierarchy and pathing of their underlying storage structure, which maxes out at millions of files. This is a problem with all file-based solutions. Maximum storage for a NAS is on the order of tens of terabytes.
Storage-area network (SAN): A SAN combines servers and RAID storage on a high-speed interconnect. With a SAN, you can put storage traffic on a dedicated fiber channel using the Fiber Channel Protocol (FCP). A request for a file operation may arrive at a SAN via TCP, but all data transfer occurs via a network dedicated to delivering data efficiently. If a dedicated fiber network is unavailable, a SAN can use Internet Small Computer System Interface (iSCSI), which uses TCP for storage traffic.
A SAN is more complicated to set up than a NAS device since it is a network and not a device. You need a separate dedicated network to get the best performance out of a SAN. Consequently, a SAN is costly and requires considerable effort to administer.
While a SAN may look compelling when compared to a NAS (improved performance and similar levels of security, availability and reliability), it is still a file-based approach with all the problems previously described. The improved performance does not make up for the extra complexity and cost. Total storage maxes out around hundreds of petabytes.
Distributed file system: A distributed file system (DFS) is a file system that spans multiple computers or servers and enables data to be stored and accessed in a distributed manner. Instead of a single centralized system, a distributed file system distributes data across multiple servers or containers, allowing users to access and modify files as if they were on a single, centralized file system.
Some popular examples of distributed file systems include Hadoop Distributed File System (HDFS), Google File System (GFS), Amazon Elastic File System (EFS) and Azure Files.
Files can be secured, like the file-based solutions above, since the operating system is presented with an interface that looks like a traditional file system. Distributed file systems run in a cluster that provides reliability. Running in a cluster may result in better throughput when compared to a SAN; however, they still run into scaling problems when there are a large number of files (like all file-based solutions).
Object storage: Object storage has been around for quite some time but was revolutionized when Amazon made it the first AWS service in 2006 with Simple Storage Service (S3). Modern object storage was native to the cloud, and other clouds soon brought their offerings to market. Microsoft offers Azure Blob Storage, and Google has its Google Cloud Storage service. The S3 API is the de facto standard for developers to interact with storage and the cloud, and there are multiple companies that offer S3-compatible storage for the public cloud, private cloud, edge and co-located environments. Regardless of where an object store is located, it is accessed via a RESTful interface.
The most significant difference with object storage compared to the other storage options is that data is stored in a flat structure. Buckets are used to create logical groupings of objects. Using S3 as an example, a user would first create one or more buckets and then place their objects (files) in one of these buckets. A bucket cannot contain other buckets, and a file must exist in only one bucket. This may seem limiting, but objects have metadata, and using metadata, you can emulate the same level of organization that directories and subdirectories provide within a file system.
Object storage solutions also perform best when running as a distributed cluster. This provides them with reliability and availability.
Object stores differentiate themselves when it comes to scale. Due to the flat address space of the underlying storage (every object in only one bucket and no buckets within buckets), object stores can find an object among potentially billions of objects quickly. Additionally, object stores offer near-infinite scale to petabytes and beyond. This makes them perfect for storing datasets and managing large models.
Below is a storage scorecard showing solutions against requirements.
The Best Storage Option for AI
Ultimately, the choice of storage options will be informed by a mix of requirements, reality and necessity; however, for production environments, there is a strong case to be made for object storage.
The reasons are as follows:
- Performance at scale: Modern object stores are fast and remain fast even in the face of hundreds of petabytes and concurrent requests. You cannot achieve that with other options.
- Unstructured data: Many machine learning datasets are unstructured — audio, video and images. Even tabular ML datasets that could be stored in a database are more easily managed in an object store. For example, it is common for an engineer to treat the thousands or millions of rows that make up a training set as a single entity that can be stored and retrieved via a single simple request. The same is true for validation sets and test sets.
- RESTful APIs: RESTful APIs have become the de facto standard for communication between services. Consequently, proven messaging patterns exist for authentication, authorization, security in motion and notifications.
- Encryption: If your datasets contain personally identifiable information, your data must be encrypted while at rest.
- Cloud native (Kubernetes and containers): A solution that can run its services in containers that are managed by Kubernetes is portable across all the major public clouds. Many enterprises have internal Kubernetes clusters that could run a Kubernetes-native object storage deployment.
- Immutable: It’s important for experiments to be repeatable, and they’re not repeatable if the underlying data moves or is overwritten. In addition, protecting training sets and models from deletion, accidental or intentional, will be a core capability of an AI storage system when governments around the world start regulating AI.
- Erasure coding vs. RAID for data resiliency and availability: Erasure coding uses simple drives to provide the redundancy required for resilient storage. A RAID array (made up of a controller and multiple drives), on the other hand, is another device type that has to be deployed and managed. Erasure coding works on an object level, while RAID works on a block level. If a single object is corrupted, erasure coding can repair that object and return the system to a fully operational state quickly (as in minutes). RAID would need to rebuild the entire volume before any data can be read or written, and rebuilding can take hours or days, depending on the size of the drive.
- As many files as needed: Many large datasets used to train models are created from millions of small files. Imagine an organization with thousands of IoT devices, each taking a measurement every second. If each measurement is a file, then over time, the total number of files will be more than a file system can handle.
- Portable across environments: A software-defined object store can use a local file, NAS, SAN and containers running with NVMe drives in a Kubernetes cluster as its underlying storage. Consequently, it is portable across different environments and provides access to underlying storage via the S3 API everywhere.
MinIO for ML Training and Inference
MinIO has become a foundational component in the AI/ML stack for its performance, scale, performance at scale and simplicity. MinIO is ideally configured in a cluster of containers that use NVMe drives; however, you have options to use almost any storage configuration as requirements demand.
An advantage of implementing a software-defined, cloud native approach is that the code becomes portable. Training code and model-serving code do not need to change as an ML project matures from a proof of concept to a funded project, and finally to a model in production serving predictions in a cluster.
While portability and flexibility are important, they are meaningless if they are at the expense of performance. MinIO’s performance characteristics are well known, and all are published as benchmarks.
Researchers starting a new project should install MinIO on their workstations. The following guides will get you started.
If you are responsible for the clusters that make up your development, testing and production environments, then consider adding MinIO.
Learn by Doing
Developers and data scientists increasingly control their own storage environments. The days of IT closely guarding storage access are gone. Developers naturally gravitate toward technologies that are software defined, open source, cloud native and simple. That essentially defines object storage as the solution.
If you are setting up a new workstation with your favorite ML tools, consider adding an object store to your toolkit by installing a local solution. Additionally, if your organization has formal environments for experimentation, development, testing and production, then add an object store to your experimental environment. This is a great way to introduce new technology to all developers in your organization. You can also run experiments using a real application running in this environment. If your experiments are successful, promote your application to dev, test and prod.