Google Cloud Upgrades Kubernetes Engine to Meet Large Language Model Demands AI NEWS

Home
AInews
Google Cloud Upgrades Kubernetes Engine to Meet Large Language Model Demands

Google Cloud Upgrades Kubernetes Engine to Meet Large Language Model Demands

2024-11-14

As the parameter size of generative AI models continues to grow, some models have reached the 2 trillion parameter level, leading to a surge in computational and storage demands for large-scale language models.

Recently, Google Cloud announced an upgrade to its Kubernetes Engine (GKE) to meet the demands of larger-scale models. Now, GKE supports clusters with up to 65,000 nodes, a significant increase from the previous limit of 15,000 nodes. This upgrade provides ample scale and computational power to handle some of the world's most complex and resource-intensive AI workloads.

Training these models with trillions of parameters requires clusters with over 10,000 nodes to run AI accelerator workloads. Parameters, as variables within AI models, control their behavior and predictive capabilities. Increasing the number of variables can enhance the model's prediction accuracy. These parameters function like knobs or switches that developers can adjust to optimize performance and precision.

The Senior Director of Kubernetes and Serverless Products at Google Cloud stated that the global scale of large language models (LLMs) continues to expand, requiring exceptionally large clusters for efficient operation. These clusters must not only be large in scale but also reliable, scalable, and capable of addressing the challenges posed by large LLM training workloads.

GKE is Google's managed Kubernetes service designed to simplify the operation of containerized environments. GKE can automatically add or remove hardware resources, such as dedicated AI chips or GPUs, based on changing workload demands. Additionally, it handles Kubernetes updates and other maintenance tasks.

The new 65,000-node cluster can manage AI models distributed across 250,000 Tensor Processing Units (TPUs), which are specialized AI processors designed to accelerate machine learning and generative AI workloads. This represents a fivefold increase in the number of TPU chips per GKE cluster, up from the previous 50,000.

This upgrade significantly enhances the reliability and efficiency of running large-scale AI workloads. For extensive AI training and inference, increased scale is crucial as Kubernetes allows users to handle hardware-based failures without worrying about downtime. Additionally, the extra capacity enables more model iterations to be run in a shorter time frame, thereby accelerating job completion.

To achieve this upgrade, Google Cloud is migrating GKE from the open-source etcd (a distributed key-value store) to a more robust system based on Google's distributed database, Spanner. This will enable GKE clusters to handle nearly limitless scale and offer lower latency.

Google has also made significant improvements to the GKE infrastructure, greatly increasing its scaling speed to help customers meet demands more rapidly. Currently, a single cluster can run five jobs, each reaching the record scale previously achieved by Google Cloud in training LLMs.

The driving factors behind this upgrade include customer focus on AI systems, the widespread adoption of AI within systems, and the rapid growth of AI across the industry. Google Cloud customers, including cutting-edge AI model developers like Anthropic PBC, have been leveraging GKE's clustering capabilities to train their models.

Reportedly, over the past year, the usage of TPUs and GPUs on GKE has increased by 900%. This growth is driven by the rapid advancement of AI, and AI is expected to account for the vast majority of Kubernetes Engine usage in the future.

Dia Browser

Dia Browser - AI browser for an improved web browsing experience

Visual Electric

Visual Electric - AI image generator for collaborative design projects

Marvel

Marvel - Interactive prototyping tool for seamless team collaboration

Coolors

Coolors - Generate custom color palettes

Khroma

Khroma - AI tool for generating personalized color palettes

Kiro AI

Kiro AI - AI IDE transforming prompts into actionable specs

Watermark Remover

Watermark Remover - AI tool for automatic watermark removal

RECENT AI TOOLS

Yes Chat

Dia Browser

Visual Electric

Marvel

Coolors

RECENT AI NEWS

AWS Launches Vector Capabilities on Amazon S3

Google Launches Opal, a No-Code Tool for Building AI Mini-Apps

Qwen Launches Qwen3-Coder: Large Agent-Based Coding Model with Open Tools

New ChatGPT Agent Enables Booking, Browsing, and Form Filling—But Trust It Carefully

Trump Reveals Consideration of Splitting NVIDIA During AI Plan Speech

Cognition's AI Developer 'Devin' Eyes $10 Billion Valuation

Leena AI Introduces Voice-Functional AI 'Colleague' to Enhance Workplace Collaboration

Elon Musk Announces AI-Powered Reboot of Vine

RECENT AI TOOLS