Efficient cooling for high-performance computing

Liquid cooling for high-performance computing

For many years now, predictive AI has been an integral part of many data center control and monitoring systems. It helps data center operators to increase energy efficiency and detect impending failures at an early stage. With the triumph of ChatGPT, the subject of generative AI is now also evolving into an international megatrend. Drafts, text, images, videos – nowadays everything can be processed or changed using generative AI. And with Copilot, now Microsoft has even integrated its GPT-4 language model – which also forms the basis for ChatGPT – directly in Windows 11, its 365 Office products, and its Bing search engine. This development requires data and storage capacities that would have been unthinkable just a few years ago.

But also in science, medicine, and applications such as autonomous driving, the requirements for computing performance are growing more exacting all the time. Add to this the fact that servers used for AI applications multiply their performance capabilities and energy requirement many times over with each new generation. This creates huge power densities in the racks and pushes the air cooling systems commonly used in data centers to their limits. Therefore, liquid cooling is the obvious choice for cooling servers efficiently and reliably despite this additional load.

The most common method of cooling a data center involves the traditional separation of the data center into hot and cold aisles in a process known as containment. Then, cold air can be blown through the raised floor or directly into the cold aisles. The servers take the cold air in at the front, emit their heat into the air and blow it back out into the hot aisles at the back of the rack. There, the air is conveyed through ducts into the air conditioning units and cooled once more. Alternatively, server racks can also be supplied with cold air by a rack-based cooling system. Here, side coolers on the front of the rack emit cold air to the servers and take in the heated air at the back, to cool it once again. However, if an airflow is conveyed through IT equipment, it will generally not reach all components uniformly. This effect is especially pronounced in room air cooling, whereas rack-based side cooling systems such as the Stulz CyberRow have a much lower risk of hot spot formation. If only air cooling is used, the achievable power density per rack is roughly 50 kW in practice. This figure is more than adequate for many IT applications, but rapidly becomes a limiting factor where high-performance AI systems are concerned.

Liquid cooling: Energy efficient cooling also at high power densities

If liquid cooling is used, hot and cold aisles are not needed in some cases, because most of the heat transfer takes place in a closed system without an intermediate medium. Here, additional air cooling is only required for cooling certain components such as power supply units, for instance, and for the heat load generated by the tank itself if immersion cooling is used. Nevertheless, there needs to be sufficient space between the racks or tanks, to allow for maintenance work or to replace equipment. As it requires less room, liquid cooling is also ideal for edge locations with little space and frequently changing ambient temperatures. Overall, liquid can absorb more heat than air, which means that the power density can also be significantly increased: with liquid cooling, figures of 120 kW per rack can be achieved without problem. Even power densities of 250 kW are no rarity in industry. In practical use, this places additional demands on the electricity infrastructure and hydraulics. Where waste heat recovery is concerned, liquid cooling has advantages over pure air cooling, because a higher temperature level can be reached, making direct connection to a transfer heat exchanger easier.

Versions of Liquid Cooling

Currently, different versions of liquid cooling are available, which differ in design and efficiency. In one version, the parts to be cooled come directly into contact with the cooling liquid (immersion cooling); in the other, the components are equipped with a heat sink, through which the cooling liquid flows (direct-to-chip liquid cooling).

In direct-to-chip liquid cooling, it is somewhat easier to convert air-cooled systems, because there is generally no need to completely replace the servers or racks. In an ideal scenario, the existing servers can simply be equipped with different heat sinks and the racks with a distribution system, to which the pipes of the individual servers can be connected. From here, a pipe is then routed out of the rack and connect to a CDU (coolant distribution unit).

The CDU is then connected via a heat exchanger to the water circuit of the building. The pipes required for this can be routed in the existing raised floor, for example. Direct-to-chip liquid cooling has no problem functioning with water and is not necessarily reliant on relatively expensive dielectric fluid. However, there is the risk of water escaping in the event of leakage. If dielectric fluid is used, on the other hand, leaks will have no impact on the operational reliability of the IT system.

When immersion cooling is used, the cost of converting air-cooled systems is relatively high. Usually, the existing servers must be replaced by ones that have been specially developed for immersion cooling, and which are then operated in trays or tanks of dielectric fluid. Existing racks can therefore no longer be used. As well as absolutely uniform heat dissipation, the liquid also ensures that the motherboards no longer take in any dust and therefore no longer need cleaning.

Circulation with or without pumps: 1-phase and 2-phase liquid cooling

Another difference is the way in which the cooling liquid is circulated. In 1-phase liquid cooling, a dielectric fluid is selected so as to ensure that the absorbed heat cannot reach its flash point and always remains liquid. To discharge the heat, the liquid is constantly pumped through an external heat exchanger.

In 2-phase liquid cooling, the liquid perpetually changes its aggregate state due to the differences in temperature. Due to heat absorption, the dielectric fluid exceeds its flash point at a temperature that depends on its specification, whereupon it becomes gaseous and rises up. There is a condenser in the upper part of the tank, which is cooled from outside by a water circuit. The fluid cools down when it reaches this condenser, becomes liquid and runs back down again, where it again absorbs heat. The advantage of the 2-phase version is that it manages entirely without pumps and therefore fewer moving parts are required. On the other hand, the higher GWP of these fluids has to be taken into consideration.

Conclusion

Rising heat loads per rack are demanding new methods of data center air conditioning. When power densities hit more than 50 kW, there are currently no alternatives to liquid cooling. If the direct-to-chip version is used, existing servers and racks can generally be converted and continue to be used. However, several conversion measures are required in the server rooms, and further components such as CDUs will need to be purchased. If a completely new high-performance data center is being built, immersion cooling should also be included in the plans, at any rate, and both systems extensively compared with one another at the planning stage. Whatever the version, it’s important to bear in mind that a certain proportion of air cooling will need to used in addition (direct-to-chip: 20–30% and immersion 5–10%).