A Closer Look at the Portworx Storage Cluster Architecture
Portworx is a modern, distributed, cloud native storage platform designed to work with orchestrators such as Kubernetes. The platform, from the company of the same name, brings some of the proven techniques applied to traditional storage architecture to the cloud native environment.
Continuing the series on stateful workloads and the cloud native storage offerings, I will introduce the architecture of Portworx.
What Is Portworx?
Portworx is a software-defined storage platform built for containers and microservices. It abstracts multiple storage devices to expose a unified, overlay storage layer to cloud native applications. Portworx users can deploy highly available stateful applications across multiple physical hosts in a data center, compute instances running in multiple zones, regions, and even different cloud providers.
Portworx can be easily installed on any host that runs a container runtime such as Docker. Since the platform relies on its own distributed services, it is possible to configure a multinode Portworx storage cluster without the need for installing Kubernetes. But through tight integration with Kubernetes, Portworx makes it possible to create hyperconverged or disaggregated deployments. In a hyperconverged scenario, compute and storage run on the same node while in a disaggregated scenario, only designated nodes act as storage nodes.
During the last year, Portworx has matured from being an overlay storage layer to an enterprise data platform. The current offering includes everything from integrated security to business continuity to dynamic scaling of storage pools.
Like most of the distributed platforms, Portworx implements a control plane and a data plane. The control plane acts as the command and control center for all the storage nodes participating in the cluster. Each storage node runs a data plane responsible for managing the I/O and the attached storage devices.
Both the control plane and the data plane run in a distributed mode. This ensures the high availability of the storage service. To achieve the best uptime, Portworx recommends running at least three storage nodes in a cluster. Depending on the size of the cluster, each node may run the control plane as well as the data plane components. In large clusters, it is possible to have nodes that don’t participate in the data plane which means they are not designated storage nodes.
The above illustration depicts a three-node Portworx storage cluster. The control plane runs on three individual nodes that share the same key/value database. The cluster is identified by a unique id that all the participating nodes of the control plane use. When a new node that runs the control plane joins, it is expected to use the same cluster-id.
The data plane runs on one or more storage nodes that have an attached block storage device. The data plane is responsible to manage the node-level operations and I/O redirection across the storage nodes.
The control plane uses gRPC to communicate with the data plane.
Let’s take a closer look at the components of the control plane and the data plane.
Portworx Control Plane
The control plane exposes an external interface for managing the cluster. It is used by the native CLI of Portworx, pxctl, to perform all the storage-related tasks such as the creation of volumes and storage pools. Orchestrators such as Kubernetes use this to API for coordinating the placement and scheduling of stateful pods.
The Portworx service API is available as a REST endpoint, gRPC service, and through the Container Storage Interface (CSI). Portworx has open sourced OpenStorage SDK, a specification and a library that defines common storage operations performed in the context of cloud native environments. The SDK has bindings for Golang and Python which makes it easy to invoke the API. There is also a Swagger UI available within a Portworx cluster that can be accessed on port 9021 of any node.
The nodes in the control plane use a gossip protocol to send the heartbeat, real-time statistics of I/O usage and available CPU and memory across the nodes. This mechanism ensures the high availability of the control plane. The stats from the control plane are also shared with the data plane that helps in making scheduling decisions.
The cluster’s metadata is stored in etcd, a distributed key/value database. The root of the KVDB consists of the cluster id common to all the nodes, along with other information such as volume configuration and node registration status. This acts as a single source of truth reflecting the current state of the cluster.
The provisioning management component is responsible for configuring the storage pools, provisioning volumes, sending instructions to the data plane for mounting and unmounting volumes, and even distributing the replicas of storage blocks across multiple fault-domains. Essentially, the provisioning service deals with the lifecycle of storage pools and volumes.
Finally, the background tasks component performs RAID-scan, incrementing the HA count, forcing resync of replicas, and taking snapshots of volumes based on a predefined schedule.
Portworx Data Plane
The control plane and data plane talk to each other through a gRPC protocol. All decisions taken by the control plane are sent as instructions to the corresponding node in the data plane.
The data plane performs I/O to the devices through the POSIX interface. Each node of the data plane communicates with the other node over RPC which is used for replicating the data across multiple nodes.
In a Portworx cluster, a disk or a block storage device can be attached to only one node at a time which assumes the responsibility of the data path.
When data is written to a volume exposed though a bind-mount within the container/pod, it goes to the Linux kernel via the I/O queue. The data then goes to the write-through cache which eventually commits the data to the disk. If the data is available within the cache, it responds to a read operation without going to the underlying storage. Each write operation is also associated with a timestamp which will help identify the most recent data across all the nodes. In case one of the nodes participating in a replication becomes unavailable, the time stamp will help resync the missing data on the node.
The I/O dispatch component identifies the node with the target storage volume and redirects the operation to that specific node. If a volume has multiple replicas, the I/O dispatcher ensures that each node receives a replica. The device store acts as an interface between the attached storage and the node.
The node to which the block storage device is attached becomes the transaction coordinator. The I/O targeting the volume always goes through the transaction coordinator. Portworx also supports shared volumes that are visible to all the nodes. Even in the scenario of using shared volumes, the transaction coordinator takes the responsibility of committing the write and sending an acknowledgment to the up steam components.
When a write operation is performed across the replicas of a quorum, the write is acknowledged by the Linux kernel which will be forwarded to the user.
Portworx’s data plane also ensures the security of data at rest. It is done through dm-crypt, a disk encryption system built into the Linux kernel based on the crypto subsystem and device-mapper. On Intel CPUs, Portworx takes advantage of hardware acceleration to minimize the burden on hosts. Every write operation goes through the encryption process before committed to the disk.
Portworx has a fascinating architecture to implement a modern, distributed, cloud native storage platform. In the future articles of this series, I will cover the security, custom scheduling, migration, disaster recovery, and dynamic volume management aspects of Portworx. Stay tuned!
Janakiram MSV’s Webinar series, “Machine Intelligence and Modern Infrastructure (MI2)” offers informative and insightful sessions covering cutting-edge technologies. Sign up for the upcoming MI2 webinar at http://mi2.live.
Portworx is a sponsor of The New Stack.
Feature image by dariasophia from Pixabay.