Google Launches LLM-Evalkit to Bring Order and Metrics to Prompt Engineering AI NEWS

Home
AInews
Google Launches LLM-Evalkit to Bring Order and Metrics to Prompt Engineering

Google Launches LLM-Evalkit to Bring Order and Metrics to Prompt Engineering

2025-10-21

Google has introduced LLM-Evalkit, an open-source framework built on Vertex AI SDKs designed to bring structure and measurability to prompt engineering for large language models. This lightweight tool aims to replace fragmented documents and guesswork-driven iterations with a unified, data-driven workflow. As Michael Santoro points out, anyone who has worked with LLMs understands the frustration—teams experimenting in one console, saving prompts elsewhere, and facing inconsistent measurements. LLM-Evalkit consolidates these efforts into a single, cohesive environment where prompts can be created, tested, versioned, and compared side by side. By maintaining a shared record of changes, teams can now track which modifications improve performance, moving beyond memory or spreadsheets. The toolkit operates on a simple principle: stop guessing, start measuring. Instead of asking which prompt “feels” better, users define a specific task, assemble a representative dataset, and evaluate outputs using objective metrics. This framework ensures every improvement is quantifiable, transforming intuition into evidence. The approach integrates seamlessly with existing Google Cloud workflows. Built on Vertex AI SDKs and connected to Google’s evaluation tools, LLM-Evalkit establishes a structured feedback loop between experimentation and performance tracking. Teams can run tests, compare outputs, and maintain a single source of truth for all prompt iterations—without switching between multiple environments. Additionally, Google designed an inclusive framework. Through its no-code interface, LLM-Evalkit makes prompt engineering accessible to a broader range of professionals—from developers and data scientists to product managers and UX writers. By lowering technical barriers, it enables faster iteration and fosters closer collaboration between technical and non-technical team members, making prompt design a truly interdisciplinary effort. Santoro shared his enthusiasm on LinkedIn:

I'm thrilled to announce a new open-source framework I've been developing—LLM-Evalkit! It's designed to streamline the prompt engineering process for teams working with LLMs on Google Cloud.

The announcement has drawn attention from industry professionals. One user commented on LinkedIn:

This looks great, Michael. We’ve struggled with the lack of a centralized system to track changes in prompts over time—especially during model upgrades. Looking forward to trying this out.

LLM-Evalkit is now available as an open-source project on GitHub, integrated with Vertex AI, and comes with tutorials in the Google Cloud Console. New users can take advantage of Google’s $300 credit to explore its capabilities. With LLM-Evalkit, Google aims to transform prompt engineering from a process of improvisation into a repeatable, transparent practice—one that becomes smarter with every iteration.

Vizcom AI

Transform sketches into 3D models and edit them

Keploy

Automated testing made easy with AI technology

Figma Make

Create prototype apps from existing designs

Doctronic

AI platform providing personalized health guidance

3D Look AI

AI body scanner for accurate body measurements

VulnZap

AI code vulnerability scanner

The Furnisher

AI room design tool for quick makeovers

RECENT AI TOOLS

Plaud

Vizcom AI

Keploy

Figma Make

Doctronic

RECENT AI NEWS

New Deepseek Technique Balances Signal Flow and Learning Capability in Large AI Models

Lightricks Open-Sources AI Video Model LTX-2 to Challenge Sora and Veo

Motional Puts AI at Core of Robotaxi Revival, Targeting 2026 Launch

Google Announces New Agreement to Drive Business Activities with AI Agents

Google Removes AI Overviews for Certain Medical Queries

Musk to Launch xAI's First AI-Powered Coding Tool, Grok Build, Next Month

RECENT AI TOOLS