Rescale optimizes architecture for HPC with NVIDIA AI.
In 1965, Intel cofounder Gordon Moore described his observation that the number of transistors on a computer chip doubles approximately every 2 years. This observation became known as “Moore’s Law” and for decades remained true. Approximately every 2 years, the doubling of transistors led to a doubling of the respective computational power of a chipset.

But Moore’s prediction is not a law of nature, and it was only a matter of time before nature broke it. We are already reaching the limits of computational power due to increases in transistors. At the end of the day, chips can only hold so many transistors before the energy and financial costs eat up any performance gains. As a result, significant changes in hardware or transistor design are required to achieve ongoing increases in performance. But if we aren’t close to fundamentally changing the underlying hardware of a chipset, how can computational power continue to be optimized?
One strategy is to develop a domain-specific architecture and better utilize hardware to meet specific workload needs. Even in the cloud, there are thousands of potential hardware and software architecture configurations that can be employed in high performance computing (HPC), so it can be challenging to know where to begin with the optimization process. In late 2022, Rescale and NVIDIA announced a partnership to develop a full-stack solution capable of supporting this decision-making process and to continue increasing computational performance.
The result was Rescale’s Compute Recommendation Engine (CRE), powered by NVIDIA AI technology. It can provide architecture configuration recommendations that best suit a given workload. Rescale reports that the CRE platform already leads to 200 percent increases in simulation speeds and more than 60 percent savings in computational costs.
The Core of the Compute Recommendation Engine
The success of large-scale data analytics and cloud-based research and development is entirely dependent on available computing power. For HPC-as-a-service products, like those offered by Rescale, companies can choose the architecture they need for a given workload. The configuration can then have a big impact on the power and efficiency of cloud-based workloads—especially simulations. So, choosing the ideal infrastructure configuration can lead to faster problem-solving, more design iterations and energy savings.
Rescale’s CRE solution aims to address this problem by creating fast suggestions for optimizing the architecture of a company’s given computational needs. The CRE platform is based on usage patterns and performance from Rescale customer workloads, as well as data on tooling, automation, hardware, software, storage and networks from a variety of sources. Using this wealth of information, Rescale employed recurrent neural networks by NVIDIA AI to create a series of tags that align hardware configuration with performance for given use cases. The tool can make recommendations for performance-based metrics, energy and cost-based metrics.
NVIDIA and Rescale are uniting their HPC and AI expertise to deliver their architecture recommendations. The partnership is a full-stack solution for optimizing cloud-based workloads for anything from big data-supported decision-making to early-stage simulations for science and engineering research.
Continuous Optimization—for Hardware and Beyond
Like any AI-powered tool, one of the biggest attractions of the recommendation engine is that it is iterative by design. The CRE platform continuously collects and analyzes metadata so it can recommend further optimization based on the workload performance on the initial chip architecture. For example, architecture with a lower interconnect latency would be recommended if the CRE notices that the passaging metadata reveals a networking bottleneck. So, additional recommendations are continuously offered within the tool, and the AI-powered suggestions are designed to stay on top of new hardware and software as they become available. With new HPC options continually hitting the market, it can be difficult for engineering teams to critically evaluate each new solution in real time. Instead, tools like the CRE can monitor new products and feed them into their optimization suggestions as data is generated.
The end result allows engineers to focus on their specific expertise instead of needing to stay on top of new chipset or GPU releases and how they might alter their simulation strategy or timing. The CRE can monitor simulation performance, manage troubleshooting and automate HPC workloads to help engineers focus on more critical tasks instead of job setup and execution.
Beyond hardware architecture-based recommendations, the tool can also identify critical bottlenecks in the HPC architecture, like storage or memory performance. Not only can this information dictate the optimal architecture, but it can also then be used as feedback for developers to alter code architecture to further improve simulation performance.
Currently, Rescale reports a greater than 95 percent accuracy for the CRE tool based on its internal testing and validation. This will likely continue to increase with further use and ongoing optimization. As with most AI-powered tools, the more simulation workloads that are optimized using the CRE platform, the more robust the suggestions will become over time.
The Next Step in Optimized Performance
For many engineers, cloud-based computing seemed like the solution to complex infrastructure and architecture design optimization. However, cloud computing architecture and HPC as a service is creating a new series of challenges as large-scale simulations and workloads can still be optimized through iterative full-stack design. Rescale and NVIDIA are investing in AI to deliver the optimization needed to provide better and faster solutions in the cloud.
Rescale states that the CRE tool is the most critical change to full-stack computing since the introduction of cloud-based computing. Although there is likely some significant bias in Rescale’s opinion, the company is correct that this could change how engineers think about cloud computing. Overall, this stands to decrease the time and energy required to optimize architecture for individual workloads and simulations.
Where AI is often touted as a strategy to save time and money, a big benefit of the CRE platform is simply being able to run more simulations with the same investment. If workloads are optimized, engineers can get away with running more simulations, or more detailed simulations, in the same amount of time and often for the same amount of money as was the case previously with unoptimized workflows.
As simulations become increasingly complex, the deadline for results doesn’t change as such specific architecture optimization will become an absolute necessity. With the CRE platform, Rescale and NVIDIA are relying on engineers wanting to focus on the design of their workloads and simulations and the results themselves, leaving the architecture design to an AI-powered tool. However, there will likely be a balance between teams that still want control over their architecture design and those that want instantaneous suggestions to accelerate their workflow.