Open Compute Project Global Summit Presentation

October 17, 2024

Overview

When deploying liquid-cooled infrastructure for AI-enabling accelerated compute, failure is not an option. While data center efficiency has been studied, data center reliability with new liquid cooling technologies is a gap. Understanding component level reliability in a liquid cooling system supports robust rack infrastructure and efficient data center operation.

This presentation from the OCP Global Summit incorporates reliability engineering theories and expounds on techniques applied for quick disconnect components. Predictive techniques using models derived from empirical data through artificial aging with thermal accelerating factors will be shared to demonstrate a general, base failure rate. In complement, a physics of failure approach will highlight specific failure mechanisms within a cooling loop for qualitative, yet broadly applicable system specific analysis. Examples of critical variables within a quick-disconnect such as materials, environment, hydraulic and mechanical stresses will be provided.