Software Startup Aims at Cluster Computing Barriers
Poorly written code often leads to poor utilization of hardware resources. Hardware makers are actively trying to solve that problem, but software makers are also providing tools to use the best computing resources available in a geographically distributed cloud environment.
Agnostiq, which has its roots in high-performance and quantum computing, has upgraded a tool called Covalent as a simple way to partially offload code execution to the best hardware resources in the cloud. A coder just needs to write a few lines of code, which unlocks compute resources in Amazon Web Services for scientific computing and simulation.
The tool helps IT organizations, researchers and scientists expand the scope of hybrid clouds deployed on-premises and in the public cloud. The tool is designed to see on-premises and AWS infrastructure as a single system, and Covalent expands the hardware availability depending on the size of the problem being solved.
Agnostiq’s tool fits into a wider shift of computing beyond CPUs into specialized accelerators like graphics and artificial intelligence processors, which are doing the heavy lifting for machine-learning and scientific tasks. In many cases, coders still input specific hardware on which the code should be executed, but toolkits from companies Nvidia and Intel are now automating that process.
“Users are able to manage all of the different software dependencies in their environment for particular tasks,” Cunningham said.
For example, a software stack may be deployed in multiple clouds, and Covalent will orchestrate jobs across multiple clusters, manage software revisions, submit scripts, and manage data.
“Without a tool that is able to span multiple on-prem clusters like that, it becomes very difficult to understand — where’s the latest version of code? Where is data in the case that it becomes fragmented, and where is my compute going to be most readily available,” Cunningham said.
Covalent is a Python-based tool, and a few lines of code add AWS’ Lambda, ECS and EC2 computing resources to the toolkit. The current implementation focuses on inputting a specific code set for execution on an Amazon cluster and waiting for the output to arrive, which can then be put into the stack.
The tool includes a software development kit component to construct their workflows and a server component that involves a back-end dispatch service and a user interface.
“That server itself can be running on the HPC cluster in the case that certain nodes are available for workflow tools, which is more common these days. Or it could be running in the cloud, which could be more appropriate if the user is using a hybrid cloud configuration,” Cunningham said.
On-premises servers may not have the massive hardware resources like Nvidia’s GPUs needed for applications like machine learning. Machine learning typically requires lots of memory and storage close to the GPU and CPU inside a server, which is readily available on AWS.
A typical offload to the cloud would involve sending input to the remote cloud server, and waiting to receive the output, which can then be incorporated into the software stack. Every second counts in the execution of code, and real-time simulations in hybrid clouds with on-premises and public cloud infrastructure is seen as one of the biggest challenges in cloud computing. The barrier is the interconnect, which isn’t optimal for real-time data exchange.
The toolkit does not include streaming technologies, which adds an element of real-time processing and analysis of data as it executes on the hardware.
“That’s something that’s on our roadmap for a future release,” Cunningham said.