How businesses can achieve greener generative AI with more sustainable inference

Generating content, images, music and code, just like humans can, but at phenomenal speeds and with unassailable accuracy, generative AI is designed to help businesses become more efficient and underscore innovation. As AI becomes more mainstream, more scrutiny will be leveled at what it takes to produce such outcomes and the associated cost, both financially and environmentally.

We have a chance now to get ahead of the issue and assess where the most significant resource is being directed. Inference, the process AI models undertake to analyze new data based on the intelligence stored in their artificial neurons is the most energy-intensive and costly AI model-building practice. The balance that needs to be struck is implementing more sustainable solutions without jeopardizing quality and throughput.

What makes a model

For the uninitiated, it may be difficult to imagine how AI and the algorithms that underpin programming can carry such extensive environmental or financial burdens. A brief synopsis of machine learning (ML) would describe the process in two stages.

The first is training the model to develop intelligence and label information in certain categories. For instance, an e-commerce operation might feed images of its products and customer habits to the model to allow it to interrogate these data points further down the line.

The second is the identification, or inference, where the model will use the stored information to understand new data. The e-commerce business, for instance, will be able to catalog the products into type, size, price, color and a whole host of other segmentations while presenting customers with personalized recommendations.

The inference stage is the less compute-intensive stage out of the two, but once deployed at scale, for example, on a platform such as Siri or Alexa, the accumulated computation has the potential to consume huge amounts of power, which hikes up the cost and the carbon emission.

Perhaps the most jarring difference between inference and training is the funds being used to support it. Inference is attached to the cost of sale and, therefore, affects the bottom line, while training is usually attached to R&D spending, which is budgeted separately from the actual product or service.

Therefore, inference requires specialized hardware that optimizes cost and power consumption efficiencies to support viable, scalable business models — a solution where, refreshingly, business interests and environmental interests are aligned.

Hidden costs

The lodestar of gen AI — ChatGPT — is a shining example of hefty inference costs, amounting to millions of dollars per day (and that's not even including its training costs).

OpenAI’s recently released GPT-4 is estimated to be about three times more computational resource hungry than the prior iteration — with a rumored 1.8 trillion parameters on 16 expert models, claimed to run on clusters of 128GPUs, it will devour exorbitant amounts of energy.

High computational demand is exacerbated by the length of prompts, which need significant energy to fuel the response. GPT-4’s context length jumps from 8,000 to 32,000, which increases the inference cost and makes the GPUs less efficient. Invariably, the ability to scale gen AI is restricted to the largest companies with the deepest pockets and out of reach to those without the necessary resources, leaving them unable to exploit the benefits of the technology.

The power of AI

Generative AI and large language models (LLMs) can have serious environmental consequences. The computing power and energy consumption required lead to significant carbon emissions. There is only limited data on the carbon footprint of a single gen AI query, but some analysts suggest it to be four to five times higher than that of a search engine query.

One estimation compared the electrical consumption of ChatGPT as comparable to that of 175,000 people. Back in 2019, MIT released a study that demonstrated that by training a large AI model, 626,000 pounds of carbon dioxide are emitted, nearly five times the lifetime emissions of an average car.

Despite some compelling research and assertions, the lack of concrete data when it comes to gen AI and its carbon emissions is a major problem and something that needs to be rectified if we are to impel change. Organizations and data centers that host gen AI models must likewise be proactive in addressing the environmental impact. By prioritizing more energy-efficient computing architectures and sustainable practices, business imperatives can align with supporting efforts to limit climate degradation.

The limits of a computer

A Central Processing Unit (CPU), which is integral to a computer, is responsible for executing instructions and mathematical operations — it can handle millions of instructions per second and, until not so long ago, has been the hardware of choice for inference.

More recently, there has been a shift from CPUs to running the heavy lifting deep learning processing using a companion chip attached to the CPU as offload engines — also known as deep learning accelerators (DLAs). Problems arise due to the CPU that hosts those DLAs attempting to process a heavy throughput data movement in and out of the inference server and data processing tasks to feed the DLA with input data as well as data processing tasks on the DLA output data.

Once again, being a serial processing component, the CPU is creating a bottleneck, and it simply cannot perform as effectively as required to keep those DLAs busy.

When a company relies on a CPU to manage inference in deep learning models, no matter how powerful the DLA, the CPU will reach an optimum threshold and then start to buckle under the weight. Consider a car that can only run as fast as its engine will allow: If the engine in a smaller car is replaced with one from a sports car, the smaller car will fall apart from the speed and acceleration the stronger engine is exerting.

The same is true with a CPU-led AI inference system — DLAs in general, and GPUs more specifically, which are motoring at breakneck speed, completing tens of thousands of inference tasks per second, will not achieve what they are capable of with a limited CPU reducing its input and output.

The need for system-wide solutions

As NVIDIA CEO Jensen Huang put it, “AI requires a whole reinvention of computing… from chips to systems.”

With the exponential growth of AI applications and dedicated hardware accelerators such as GPUs or TPUs, we need to turn our attention to the system surrounding those accelerators and build system-wide solutions that can support the volume and velocity of data processing required to exploit those DLAs. We need solutions that can handle large-scale AI applications as well as accomplish seamless model migration at a reduced cost and energy input.

Alternatives to CPU-centric AI inference servers are imperative to provide an efficient, scalable and financially viable solution to sustain the catapulting demand for AI in businesses while also addressing the environmental knock-on effect of this AI usage growth.

Democratizing AI

There are many solutions currently floated by industry leaders to retain the buoyancy and trajectory of gen AI while reducing its cost. Focusing on green energy to power AI could be one route; another could be timing computational processes at specific points of the day where renewable energy is available.

There is an argument for AI-driven energy management systems for data centers that would deliver cost savings and improve the environmental credentials of the operation. In addition to these tactics, one of the most valuable investments for AI lies in the hardware. This is the anchor for all its processing and bears the weight for energy-hemorrhaging calculations.

A hardware platform or AI inference server chip that can support all the processing at a lower financial and energy cost will be transformative. This will be the way we can democratize AI, as smaller companies can take advantage of AI models that aren’t dependent on the resources of large enterprises.

It takes millions of dollars a day to power the ChatGPT query machine, while an alternative server-on-a-chip solution operating on far less power and number of GPUs would save resources as well as softening the burden on the world’s energy systems, resulting in gen AI which is cost-conscious and environmental-sound, and available to all.

Moshe Tanach is founder and CEO of NeuReality.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!