Google Launches LLM-Evalkit to Bring Order and Metrics to Prompt Engineering
Google has introduced LLM-Evalkit, an open-source framework built on Vertex AI SDKs designed to bring structure and measurability to prompt engineering for large language models. This lightweight tool aims to replace fragmented documents and guesswork-driven iterations with a unified, data-driven workflow.
As Michael Santoro points out, anyone who has worked with LLMs understands the frustration—teams experimenting in one console, saving prompts elsewhere, and facing inconsistent measurements. LLM-Evalkit consolidates these efforts into a single, cohesive environment where prompts can be created, tested, versioned, and compared side by side. By maintaining a shared record of changes, teams can now track which modifications improve performance, moving beyond memory or spreadsheets.
The toolkit operates on a simple principle: stop guessing, start measuring. Instead of asking which prompt “feels” better, users define a specific task, assemble a representative dataset, and evaluate outputs using objective metrics. This framework ensures every improvement is quantifiable, transforming intuition into evidence.
The approach integrates seamlessly with existing Google Cloud workflows. Built on Vertex AI SDKs and connected to Google’s evaluation tools, LLM-Evalkit establishes a structured feedback loop between experimentation and performance tracking. Teams can run tests, compare outputs, and maintain a single source of truth for all prompt iterations—without switching between multiple environments.
Additionally, Google designed an inclusive framework. Through its no-code interface, LLM-Evalkit makes prompt engineering accessible to a broader range of professionals—from developers and data scientists to product managers and UX writers. By lowering technical barriers, it enables faster iteration and fosters closer collaboration between technical and non-technical team members, making prompt design a truly interdisciplinary effort.
Santoro shared his enthusiasm on LinkedIn:
I'm thrilled to announce a new open-source framework I've been developing—LLM-Evalkit! It's designed to streamline the prompt engineering process for teams working with LLMs on Google Cloud.The announcement has drawn attention from industry professionals. One user commented on LinkedIn:
This looks great, Michael. We’ve struggled with the lack of a centralized system to track changes in prompts over time—especially during model upgrades. Looking forward to trying this out.LLM-Evalkit is now available as an open-source project on GitHub, integrated with Vertex AI, and comes with tutorials in the Google Cloud Console. New users can take advantage of Google’s $300 credit to explore its capabilities. With LLM-Evalkit, Google aims to transform prompt engineering from a process of improvisation into a repeatable, transparent practice—one that becomes smarter with every iteration.