Understanding Software-Defined Storage, Part One
Storage systems are essential to the data center. Given the exponential rate of data growth, it is increasingly becoming more and more challenging to scale the enterprise storage infrastructure in a cost effective way.
Storage technology over the years has seen incremental technology advancements. The early days of enterprise storage were mainly direct-attached storage (DAS) with host bus adapters (HBAs) and redundant array of independent disks (RAIDs.) The DAS advanced by more faster and more reliable protocols like ATA over Ethernet (ATA), serial attached technology adapters (SATA), external serial attached technology adapters (eSATA), small computer system interface (SCSI), serial attached SCSI (SAS), and fibre channel.
However with the boom of dot com companies and steep data growth in the mid-1990s, DAS was widely criticized as “an island of information.” NFSv3 network file system (NFS) and common Internet file system (CIFS) started getting traction around 1995 and the idea of having a centralized, shared, special purpose, storage systems became more and more popular. As a result network-attached storage (NAS) and storage area network (SAN) systems became the de-facto for any serious enterprise storage infrastructure in coming years. The adaptation rate of SAN/NAS storage systems was also driven by the fact that complexity and cost of having such systems got drastically reduced in the early 2000s. Today adding a NAS/SAN system in a data center has become a plug and play task that just need network design consideration and rest all can be done via a Web UI. These boxes contain almost all enterprise friendly features such as reliability, fault tolerance, high availability and clustering, deduplication, backup, low latency, high throughput and bandwidth, QoS. There are even different boxes around different enterprise themes and needs like a 2X faster box, a 4X capacity box or a 0.5X $ per GB box. However in recent times cloud and big data growth with their petabyte and exabyte focus made the enterprises to rethink their storage infrastructure. The box base approach is proving more and more expensive and hard to scale to suite these workloads. Vendor lock in’s and premiums paid for maintenance is also a cause of worry.
The Software Defined Storage Buzz
In oversimplified words software defined storage (SDS) is an approach to enterprise storage that uses more intelligent and hardware independent piece of software with commodity hardware. It is too early to see it as a complete substitute to a box-based approach, but for now let’s assume it is a serious challenger. However an idea for a substitute to anything that is working “just fine” always required a lot of convincing and pursuance. To convince others that the idea has the potential to substitute existing solutions, it is important that we understand what existing solutions are. So we will first focus on what the existing storage systems has to offer. It would be important to understand the strengths and weaknesses and how their differences compare. We will focus on modern day DAS, NAS and SAN systems. Then we will move on to SDS to see how it solves the enterprise storage problems compared to existing storage systems.
As the name suggests DAS is a topology of attaching storage directly to the server which also serves as the host for the application that is consuming the storage. To overcome the limitation of drive failure, lower throughput and bandwidth due to mechanical disk spindle; this topology make use of RAID subsystems. RAID is an array of disks to achieve high throughput, bandwidth, and fault tolerance by means of data striping, replication, parallel I /Os and disk virtualization. RAID is quite a mature and proven technology and enough literature is available to understand the RAID concepts and different RAID levels. It is beyond the scope of this article to go into the details of the RAID concepts. However it is advisable to read about RAID, if you are not already aware of the concept. Basic data flow in a DAS system is explained in the figure below:
The problem with DAS topology is its intrinsic design limitation of attaching storage to a single server.
That makes DAS unsuitable for modern data center requirements of highly scalable and highly available applications and storage systems.
SAN refers to the storage topology where many virtualized disks (a.k.a logical units or LUNs) are made available to application servers over ethernet or fibre channel network via means of dedicated storage appliances connected to the network. Usually the SAN boxes come with certain QoS parameter for throughput, bandwidth and fault tolerance.
The enablers of these features are RAID technology and vendor specific ‘magic’ added to it. These boxes also provide ease of use as the storage management and provisioning for application servers is outsourced to a centralized and specialized device (or device pool).
This also makes data more isolated from the problems of application servers. For high availability LUNs can also be shared among various hosts by making use of different clustered FS or by using the OS specific clustering mechanism. Basic data flow in a SAN system is explained in the figure below:
However all this comes at a cost. If your data needs are growing, then be ready to pay out more for a ‘compatible’ box or go with less expensive yet unsupported piece of hardware. So the choices are limited, and there are no U-turns. Contrary to the popular belief most vendors make use of more or less the same disks; however the value proposition and stickiness comes from the logic embedded in the box, which outside world has no idea of.
NAS refers to a storage topology where storage clients (application server/user machines) can store and share files using network file sharing protocols hosted on a dedicated storage appliance. NFS protocols takes care of authentication, access control, file locking, transactional integrity of data, namespace management, etc. While the responsibility of providing better throughput, bandwidth and fault tolerance is the task left to underlying RAID subsystems. Advanced NAS systems come with addition features like file snapshots, sophisticated backup and recovery methods, QoS, easier storage provisioning and management, high availability and scaling out the infrastructure as demand for storage grows.
NAS adds one more layer of abstraction over typical SAN systems thus releaving the clients from the responsibility of creating and managing general purpose file systems or clustered file systems. NAS can be seen as an incremental development to SAN systems not replacing it but complementing it by providing an easier interface. There are also hybrid NAS/SAN systems available as well that provide both the interfaces. The basic data flow of NAS system is explained in the figure below:
Nevertheless the problems of vendor lock-ins and premium paid for SAN system is thus inherited in NAS systems.
Understanding Software Defined Storage
SAN and NAS boxes remain the mainstay in enterprise storage market. Conversely some start-ups which came around 2003-2013 and still pouring in are able to attract enough attention to suggest that some radical changes are due to happen that will change the way enterprise storage is thought and bought. These startups prefer to differentiate their offering from that of industry heavyweights by using different terms like “dispersed storage network”, “scale out NAS”, “clustered file system ”, “converged storage”, “virtualized storage”, “parallel fault tolerant file system”, etc. These can all be majorly classified under a single umbrella of SDS.
As per the Storage & Network Industry Association, the term SDS is a marketing ‘buzzword’ that is a follow-on to the term ‘software defined networking’, which was first used to describe an approach in network technology that abstracts various elements of networking and creates an abstraction or virtualized layer in software. SDS has been proposed (SNIA ca. 2013, 2013 Storage Developer Conference) as a new category of storage software products. In March 2014, SNIA began a draft technical work available for public review on SDS.
Defining the term SDS is a tricky task, as there are many stakeholders who want the term to be most closely associated with their idea of enterprise storage’ s future. To put it simply SDS is a class of storage solutions that can be used with commodity storage media and compute hardware; where storage media and compute hardware have no special intelligence embedded in them. All the intelligence of data management and access is provided by a software layer. The solution may provide some or all the feature of modern enterprise storage systems like scale up and out architecture, reliability and fault tolerance, high availability, unified storage management and provisioning, geographically distributed data center awareness and handling, disaster recovery, QoS, resource pooling, integration with existing storage infrastructure, etc. It may provide some or all data access methods like file, block and object.
SNIA has put up a detailed view on SDS functionality and attributes, most of them are covered in above attempt to simplify the specification. A generic data flow in a SDS solution is explained in the figure below:
SDS can also be seen as an extension of textbook term ‘distributed file systems.’ Popular open source example of distributed file systems include OpenStack Swift, Gluster, Ceph, HDFS, RozoFS, HekaFS, Luster, XtremeFS, MooseFS, Quantcast File System, and many more. Some of these open source projects are also available with commercial support licenses. Few are more general purpose than others that are tightly coupled with Hadoop or other ecosystems. Proprietary SDS examples are Nexenta, CleverSafe, RiakCS, etc.
Initially the big players of box based businesses remained in a denial mode of any possible threat by SDS to their business models. Creating stickiness to their product always has been every business fantasy and ultimate goal. But the market forces and innovation decides the shape of future technology. The myth of SDS being merely a hype or brainchild of some open source enthusiasts and VCs who want to bet on the next big thing is already busted. The big players of box based businesses are now coming with their own version of enterprise SDS solutions likes of EMC’s VIPR, Scale IO and NetApp Data ONTAP Edge. With these developments, it is becoming clearer where the wind is blowing. It would not be an exaggeration to call SDS a fundamental shift from a box based storage system approach. Nevertheless understanding, deploying and managing SDS in production environment is still a challenge; it would be interesting to see the adaptation rate of SDS in production environment in near future.
Open source alternative look promising and would be a good start point to get familiar with SDS. We will try to cover some popular open source SDS solution in near future.
Disclaimer: Views expressed here are personal, and not of my employer.
Pushpesh Sharma is currently working as a senior test development engineer @ SanDisk India Device Design Center, Bangalore. He has more than six years of experience in evaluating cloud, virtualization and storage technologies . He holds a Bachelor of Engineer (Information Technology) from Govt. Engineering College, Kota (Raj.), India. He also holds a Certificate in Marketing and HRM from SJMSOM, IIT, Bombay. In his free time he likes to read (mostly anything), listen to good music and enjoy good food and wine.