You’re using the latest high-efficiency servers. You keep your server room at a higher temperature. You’ve virtualized to cut down on the number of servers. It seems like you’re doing everything right. But you’re still stuck at a less than enviable power-usage effectiveness (PUE) rating.
How do you get from where you are to where you want to go? I’ve been around data centers a long time, from closets with window AC units stuck into a wall to overseeing Amazon Web Services operations when the number of servers grew an order of magnitude as the cloud began to take hold.
There are five essential moves to achieve ultra-low PUEs. First, let me quickly outline the four foundational pillars, then we’ll get into what I think will be the true differentiator.
- Get the right, high-efficiency equipment. Give minute attention to mechanical infrastructure like fans (no belts, no slippage!), direct drive units, and variable frequency drives on all motors.
- Emphasize measurement and control. Choose a monitoring system that allows for precise high-speed industrial process control. For example, as the outside air temperature drops, the system should be able to recirculate more heated air from the data module into the supply air stream, allowing for precise temperature control.
- Increase the voltage. High-voltage distribution to the IT racks reduces transmission loss, decreases the amount of copper running into the building, and minimizes transformation losses.
- Reduce pressure drop. Start with the outside air damper array and filter bank. Choose components to minimize air resistance. These range from oversized dampers and filters to higher cross-section cooling coils.
I could talk wattage, air pressure, precision monitoring all day — there’s awesome stuff in the details! — but I want to focus on what could take you from a good PUE to a truly amazing PUE.
As a former chief information officer and current chief of operations, I think we need to talk about this and ask some tough questions. Why are we afraid to allow equipment to fail? Why are we working so hard to avoid breaking a server? Too often we are needlessly cautious when it comes to hardware. Too often we make IT decisions based on fear. These decisions cost us more than we suspect.
IT equipment can handle higher temperatures than what the industry currently operates under.
The industry needs to diversify equipment-level risk. Why should the loss of any single piece of equipment compromise your business?
Check out the Operating Envelope:
Why not move out of the recommended ASHRAE zone into allowable conditions? Operating at a higher operating envelope may mean more hardware failure, but the savings you’ll see from energy efficiency will more than cover the cost to replace hardware.
For instance, a large data center running five-plus megawatts can spend $500,000 more in energy per month. But let’s say you run the temperature up to 80 degrees in the cold aisle, taking your PUE down from 1.3 to 1.2. In this scenario, you will save over $500,000 per year in energy, and break a whole lot less than that in hardware.
Netflix is a great example of execution on bravery. Here’s a company that’s challenged its system to plan for recovery and resilience. They created Chaos Monkey, a software tool that forces their engineers to deal with small failures, ones that, once eliminated, will keep from turning into major outages. When Netflix loses a server—no big deal. The software recovers the sessions on other available hardware.
The challenge shouldn’t be to create bullet-proof hardware. The challenge instead is to write applications to be resilient. Managing state in your applications well enough to recover from server failure is not easy. But the rewards in lowered PUE and TCO are well worth the investment.
Be brave, embrace (hardware) failure, and focus on recovery. That’s the path to truly awesome PUEs.
As senior vice president of operations at Vantage Data Centers, Chris Yetman has over 18 years of operations, engineering and IT experience in the Internet infrastructure industry. He is responsible for leading operations, security, network and IT for Vantage. Previously, Chris was VP of AWS Infrastructure Operations at Amazon, where he had worldwide responsibility for operations and network for Amazon’s data centers.