Power Configuration for AI Computing

The capabilities of artificial intelligence now, and in the future, really are mind boggling. But with awe-inspiring functionality comes an incredible demand for power – so data centers need to be prepared.

Artificial Intelligence isn’t known for its power efficiency. In fact, the processor density that AI requires is so great that it’s likely to mean real changes for data centers. And these need to be made with some urgency. The demand for cooling and power requirements will be a particular focus amongst data centers catering to AI. 

Companies too need to be on the lookout for data centers that can realistically accommodate their increasing level of demand. In this article, we’ll explore some typical systems to see how they work, and which businesses they might be best for. Then, we’ll give you the lowdown on what you need to be looking for in a data center, to ensure that you choose a data center that’s fully equipped to facilitate growth in AI computing.

Typical systems and how they work

Let’s begin by looking at the operations of Typical GPU System. Such systems typically have six plugs and power supplies, and operate with four. The systems throttle at three, but this would represent a significant performance reduction. At lower than three, these systems are inoperable. 

Option 0 is what’s known as batch based processing. This is used by many seismic/HPC firms. In a planned or unplanned scenario, there might be a few hours of outages. But we need to consider the cooling and asset protection here. If a power failure would present a real risk to your operation, or could cause equipment damage, this wouldn’t be recommended. 

Option 1 is the traditional 2N: (3+3 Power supplies). In the event of power failure, this would significantly throttle the system. Whether that would be acceptable to an organization depends on its specific risk tolerance, but a failure condition would inevitably mean the loss of three power supplies and cause a throttling situation. 

Option 2 is sold as A/B power. It’s 2N with Point of Use STS on 2x Power Supplies with the following proposed architecture:

  • 2x power supplies A ONLY
  • 2x Power supplies B Only
  • 2x Power supplies on A/B via point of use STS (independent power bank sub-breaker on the PDU)

In the event of A side failure, normal operations would continue. B side failure would also mean normal operations could be facilitated, as would STS failure. The lone risk here is the theoretical major hard fault failure on PSU, propagating upstream sufficiently to knock both systems offline via STS. Given the smaller size of this STS, relative to size of proposed circuit (3~), as well as the sub-breaker on the PDU, the level of risk here is low – but not completely insignificant. 

Option 3 means a 4N/3 Fed (this is only for data centers with 4N/3 capability). The proposed architecture is:

  • 4x Power Feeds To Rack, A/B is 66% of load carried and C/D is 33% sizing.
  • PS1 – A
  • PS2 – A
  • PS3 – B
  • PS4 – B 
  • PS5 – C
  • PS6 – D

In this setup, there’s no STS so the risk of power failure is theoretically lower. But it’s complicated, because 4 whips to rack means that cabling must match upstream breakering – so the initial setup is more sophisticated. 

Option 3 is the most “pure” approach, although point of use STS are used and completely acceptable. It’s highly dependent and subject to the availability of 4x power feeds.

What to look for from your data center

When interviewing data centers, we recommend you look for a center that’s familiar with the configurations we’ve touched on, as well as high density racks. Take a look at the following tips before you approach your chosen data centers, to see what you really need to be looking for. 

Spreading the load
You might want to consider spreading the load out a little bit to target 12-24 kW racks. Remember, denser is not always better. There is no unit economic advantage.

Specialist partners
Most AI loads are extremely high density relative to normal enterprise computers. So, you really want to be working with specialists, or at the very minimum be a good neighbor.

Certainty in contracts
Look for good contracts with certainty, reasonable expectations, and the ability to work with the facility for the load around them.

Plenum sizes and cooling flexibility
Pay attention to plenum sizes and cooling flexibility. Many legacy data centers have forced air return with retrofit small plenums, which means that they might not be able to handle the spot density requirements like a spec built facility.

Questions to ask
Don’t be afraid to ask questions. Find out what the flex capacity is from cooling, and enquire about spot density. See if your chosen data center can show you 25 kW racks.

Cold air aisle temperatures
Check the temperatures of cold air aisles, and think about how you can meet ASHRAE guidelines.

Containment
Blanking and containment is important but achievable.

A word of advice
Make sure your chosen operator is comfortable with your project and business. It’s the wild west right now, with companies of all sizing making a whole range of different power requests. Think about the terms you’re asking for relative to size. 

If you’re asking for more than 100-200kW of power, you should be prepared for scrutiny of your requirements and contract terms. So, take the time to ensure you’re clear on your own needs before you start reaching out to data centers and requesting “10 MW.” And don’t underestimate the importance of hot air containment and traditional air cooling up to 30-40kW.

Vendor Solution Options Explained

We requested a short summary of vendor solution options for deployment of liquid[1]cooled ITE within the existing air-cooled data center infrastructure. 

The existing facilities and planned new facilities utilize DX cooling systems with no chilled water, so the emphasis is on systems that will reject the server heat to the room air to be removed by the room CRACs. 

Due to the different liquid cooling topologies and their suitability for deployment in DX-cooled datacenters, we have focused the list on single-phase cold plate solutions and chassis immersion systems that utilize liquid-to-air heat rejection systems. 

SINGLE PHASE COLD PLATE: These solutions couple either an outdoor dry-cooler or an indoor liquid-to-air heat exchanger to remove processor heat via a dedicated closed cooling loop and reject the heat to the data hall or outside air. 

For stability and performance of the ITE, utilizing the indoor heat exchangers will provide more predictable ITE performance enabling easier compliance with TRG’s SLA’s. 

Servers are typically modified by a third-party integrator, either recommended or engaged by the liquid cooling solution manufacturer. 

In our experience it is never the datacenter operator’s responsibility to provide or install any of the hardware inside the server chassis. Typical options are as follows:

  • jetcool SmartPlate System: Cold plate solution which rejects processor heat to air at the server chassis level. Dell servers are available with OEM installation of this cold plate solution. 
  • Ingrasys AALC (Air Assisted Liquid Cooling): Sidecar server rack with self-contained cold plate coolant loop, rejects heat to data center air. Designed around OCP ORv3 rack and server specification. The OCP ORv3 RPU spec is still fairly new, but other manufacturers are developing this product line and, in accordance with OCP requirements, it will be interoperable with OCP servers and racks. 
  • CoolIT DLC RDHx: Rear Door Heat Exchanger which, rather than rejecting server exhaust air to CHW system, rejected cold-plate coolant heat to RHDx and then to room air. 
  • SINGLE PHASE CHASSIS IMMERSION These solutions couple an outdoor dry-cooler to remove processor heat via a dedicated closed cooling loop and reject the heat to the outside air. Solutions are generally able to run with cooling water as warm as 113 deg F without ITE performance loss, making true dry cooling a viable option in Houston.

Bear in mind – consideration for SLA allowances for degraded ITE performance when the ambient air temperature exceeds the ASHRAE 20 year extreme would be prudent. 

Servers are typically modified by a third-party integrator, either recommended or engaged by the liquid cooling solution manufacturer. It is not the datacenter operator’s responsibility to provide or install any of the hardware inside the server chassis. Options are as follows:

  • Iceotope Chassis Immersion: Sealed server chassis with self-contained dielectric cooling loop. 3D printed fluid distribution within the server installed by the integrator. Heat removed from server chassis with liquid loop connected to dry coolers. 
  • Liquid Cooled Solutions: This is a direct competitor to Iceotope, the infrastructure is the same. 

The infrastructure provided at TRG Datacenters means practical solutions for high density deployments. Of course, the options we’ve mentioned are not exhaustive by any means, but should give you a good idea of the different manufacturers that are currently offering these solutions.

Remember, in chassis-immersion solutions with dry-cooling, cold-plate to air at the chassis level, cold-plate to air at the rack level via rear door or via sidecar are all feasible solutions for high density deployments within our infrastructure. 

We’ll continue to keep you updated on new solutions as and when they become available, but if you have any questions on this topic or you’d like more information from us, make sure you contact our team