What's actually inside a $100 billion AI data center?
OpenAI and Microsoft are apparently planning to build a $100 billion data center codenamed Stargate. We discuss how this compares to existing data centers and other planned investments into AI centric infrastructure. They don't seem to have enough time to design special purpose networking and other hardware, but it is a massive investment compared to other plans.
We discuss the design problems you have to solve when creating a data center. You need to power it, you need to make sure it stays cool, provide networking, and provide resilience and redundancy for everything. We also discuss how Google data centers are a little different from the norm.
Finally, we discuss the AI chips that could be present in a data center. Primarily, this means Nvidia gpus or Google TPUs. There are implications for the software stack and ultimate usability of the system.
Why Does OpenAI Need a 'Stargate' Supercomputer? Ft. Perplexity CEO Aravind Srinivas
https://www.youtube.com/watch?v=KXG2f-So9oo
Making AI accessible with Andrej Karpathy and Stephanie Zhan
https://www.youtube.com/watch?v=c3b-JASoPi0
Google AI Infrastructure Supremacy: Systems Matter More Than Microarchitecture
https://www.semianalysis.com/p/google-ai-infrastructure-supremacy
Microsoft & OpenAI consider $100bn, 5GW 'Stargate' AI data center - report
https://www.datacenterdynamics.com/en/news/microsoft-openai-consider-100bn-5gw-stargate-ai-data-center-report/
ELI5 : How's it that just 400 cables under the ocean provides all the internet to entire world and who actually owns and manages these cables
https://www.reddit.com/r/explainlikeimfive/comments/1390m3h/eli5_hows_it_that_just_400_cables_under_the_ocean/
Newmark: US data center power consumption to double by 2030
https://www.datacenterdynamics.com/en/news/us-data-center-power-consumption/
Data Centres and Data Transmission Networks
https://www.iea.org/energy-system/buildings/data-centres-and-data-transmission-networks
Understanding Data Center Costs and How they Compare to the Cloud
https://granulate.io/blog/understanding-data-center-costs-and-how-they-compare-to-the-cloud/
Cost estimate to build and run a data center with 100k AI accelerators - and plenty questions
https://www.reddit.com/r/datacenter/comments/1b5nv1v/cost_estimate_to_build_and_run_a_data_center_with/
Amazon Bets $150 Billion on Data Centers Required for AI Boom
https://www.bloomberg.com/news/articles/2024-03-28/amazon-bets-150-billion-on-data-centers-required-for-ai-boom
The world’s top data centre investors
https://www.fdiintelligence.com/content/data-trends/the-worlds-top-data-centre-investors-82669
#ai #datacenter #openai
0:00 Intro
0:26 Contents
0:33 Part 1: Data center gold rush
0:46 Server racks and data centers
1:28 What about spending $1 billion?
1:47 50,000 AI accelerators
2:12 $7 trillion previous plan
2:34 OpenAI asks for $10 billion
2:50 OpenAI asks for $100 billion
3:13 Codename Stargate, from science fiction show
3:37 100,000 GPU limit
4:03 Amazon invests $148 billion
4:29 Google has significant data center investment
5:10 Do we actually need this much compute?
5:57 Part 2: So you want to build a datacenter
6:46 Design challenge 1: power consumption
7:11 Proportion of global power consumption
8:09 Collapse of carbon credit market
8:41 Data centers use prepurchased renewable electricity
9:27 Design challenge 2: Cooling
9:44 Power for cooling exceeds power for servers
10:06 Google runs hotter data centers
10:40 Design challenge 3: Networking
11:21 Fiber optic cables
12:26 Intra-rack and inter-gpu networks
13:20 Design challenge 4: Resilience and redundancy
14:36 Don't rely on a single data center
15:22 Stargate has a single region design
15:49 Part 3: The hardware secret sauce
16:25 Hardware stacks
16:28 Nvidia has a monopoly due to CUDA
16:59 Nvidia charges very high prices
17:23 Consumer-grade GPU clusters are cheaper
17:51 Google has TPUs as a GPU alternative
18:29 TPU microarchitecture
19:02 TPU network is a 3D torus
19:59 Software stacks
20:13 PyTorch and Tensorflow
20:52 Computational graph representation
21:25 Python reflection to create graph
21:59 XLA compiler from Google
22:26 Leverages LLVM compiler technology
23:04 Example: LLVM also used in web browsers
23:22 Will Stargate use their own chips, network?
24:19 AI chips use a lot more power than usual
24:45 Implications of building Stargate
25:12 Conclusion
26:00 AI-specific hardware suppliers
26:53 Outro
Posted May 22
click to rate
Share this page with your family and friends.