Artificial intelligence (AI) has had a meteoric rise in the past year, with products such as ChatGPT and DALL-E demonstrating what large language models (LLMs) are capable of. Microsoft Bing Chat and Google Bard brought cloud-hosted chat bots to millions, and Stability AI brought image generation from the cloud to people’s PCs and laptops. And with Qualcomm recently demonstrating a port of Stable Diffusion on a smartphone equipped with a prototype Snapdragon 8 Gen 2, the prospect of ubiquitous AI is now firmly established.
But the growing scarcity of compute infrastructure capable of accommodating these workloads in the cloud is threatening to crater AI’s rise. Training models is a particularly intensive task; in August 2022, Stability AI CEO Emad Mostaque disclosed the market price for training Stable Diffusion as $600,000 for a cluster of 256 Nvidia A100 GPUs running for 150,000 hours. With more researchers, start-ups and enterprises looking to train or customize LLMs, GPU supply has since tightened, with lead times for high-end GPUs lengthening and supply in cloud platforms becoming scarcer and more expensive.
Concerns about GPUs extend beyond model training, however. When a model is trained, a product is built around it, and that product is deployed to production. Therefore, the inference cost will soon outpace training costs as the number of users grows.
Integrating AI functionality while managing constraints on cloud computing capacity and balancing operating budget with inference costs requires rationalizing how AI functionality is integrated on software and services. Development and commercialization of LLMs is in its early days, although some strategies can already be adopted to reduce operational costs as applications grow.
Why Right-Sizing Reduces Costs
The performance characteristics of LLMs vary widely, with the size of the model and the training data set important factors to consider. Other significant factors, including the quality of training data, are sadly beyond the scope of this article. Google’s first-generation Pathways Language Model, or PaLM, is a technical feat at 540 billion parameters, designed to push the limits of the maximum practical size of an LLM. For comparison, OpenAI’s well-known GPT-3 model is 175 billion parameters, and LLaMA, developed by Meta’s AI team, is available in 7 billion, 13 billion, 33 billion and 65 billion parameters.
The varying sizes of LLaMA are useful, as models with fewer parameters require fewer resources to operate: LLaMA-13B can be run on a single GPU in a standard developer workstation. Because smaller models require fewer resources, they’re more cost-effective to run on cloud virtual machines. In an academic preprint, the Meta AI team showed LLaMA-13B outperforming GPT-3 on five of seven common sense reasoning benchmarks. The full LLaMA-65B model also outperformed Google’s PaLM-540B model by the same amount. This point is literally academic, however, as LLaMA is only licensed for non-commercial research. But from this preprint, we observe that using a larger model doesn’t guarantee more accurate results.
One early strategy for cloud cost optimization was the premise of “right-sizing” workloads. If, for example, you deploy an m7g.16xlarge instance on Amazon Web Services to host a WordPress installation, you’ve massively overprovisioned for your needs — 64 virtual CPUs and 256GB RAM is beyond excessive here. As with choosing a virtual machine that fits the needs of your workload, choosing the best-fit model size meaningfully decreases the resources required to run a workload, potentially unlocking greater savings in cloud spending and greater flexibility in how AI is deployed in your application.
Relatedly, it’s possible to fine-tune models to increase the accuracy of the output and reduce the compute resources needed for inference. In situations where the expected input and requested output are predictable, such as translating text between English and Japanese, relying on parameter-efficient fine-tuning techniques can reduce the compute and storage requirements of LLMs while providing better performance.
How Hybrid AI Adds Flexibility
Given evidence that larger models don’t necessarily perform better, and that models can be fine-tuned to reduce the resources they use, running inference workloads at the edge in production applications becomes viable. Qualcomm promotes this idea in a report, highlighting scenarios such as the ability to simultaneously run smaller models locally and larger models in the cloud, with the latter correcting inaccuracies in the former as needed.
Perhaps the most persuasive and obvious example of workload division is Qualcomm’s example of sensory segmentation. When a user speaks to a smart assistant on a smartphone, that input can be transcribed to text using an automatic speech recognition (ASR) model, which is relayed to a cloud-hosted AI model for processing and sent back to the smartphone to be read out by a text-to-speech model. Initial versions of Siri, Alexa and Google Assistant performed all of this in the cloud, although ASR has been moved on-device for all three smart assistants in the past two years. Qualcomm, Apple, Amazon and Google also highlight the privacy implications of performing this inference on-device, as user audio wouldn’t need to be uploaded to the cloud.
Why Hybrid AI Matters
Dedicated extensions for AI workloads are coming to client devices, making it possible to offload inference tasks to client devices without expecting users to have a full-power GPU to bear the load. Applications can be built for the future and some current devices to properly use these capabilities, improving performance and lowering operating costs in the process.
More details, including resources and documentation for programmers, are expected in 2023 from Apple at its Worldwide Developers Conference in June, Intel at its Innovation event in September and from Qualcomm at Snapdragon Summit toward the end of the year. To learn more about our extensive AI research for clients, get in touch with us.