Insights & Expertise

Discover the latest trends, insights, and best practices in AI evaluation and reliable product development.

AI-Assisted Infrastructure Management: How We Use Claude Code with Kubernetes and OpenTofu

Running Kubernetes in production on multiple cloud providers means juggling OpenTofu configurations, Helm charts, and deployment pipelines. Here's how we use Claude Code as an infrastructure copilot with safety guardrails, custom skills, and encoded domain knowledge.

René FleschenbergMarch 4, 202612 min

InfrastructureDevOpsClaude CodeKubernetes

AI-Assisted Infrastructure Management: How We Use Claude Code with Kubernetes and OpenTofu

Latest Articles

Stefano ArcaroFeb 26, 20268 min

Not All Chinese LLMs Censor

We tested 168 sensitive China-related topics across 10 LLMs. One Chinese model matched GPT-5.2 and Claude. Another rewrote the Tiananmen massacre as state-approved fiction.

EvaluationModel Comparison+2 more

Dominik RömerFeb 19, 202610 min

How to Evaluate RAG Applications: A Health Insurance Case Study

A complete framework for RAG evaluation covering test set design, targeted criteria for retrieval and generation, experiment analysis, and continuous production monitoring.

RAGEvaluation+2 more

Stefano ArcaroFeb 17, 20267 min

Iterate Faster on AI Evaluation with MCP

Your evaluations say your AI is perfect. You know it's not. Here's how we used MCP to iterate rapidly and surface real limitations.

MCPEvaluation+2 more

Raphael HuppertzFeb 11, 20267 min

Bring Your Own Dataset: Integrate Langfuse on elluminate

Import your Langfuse datasets directly into elluminate. Turn production traces into structured evaluations - no export scripts or CSV wrangling required.

IntegrationLangfuse+3 more

Björn PlüsterJan 26, 20268 min

Why We Use Binary Yes/No Evaluations (And You Should Too)

Binary pass/fail evaluations beat Likert scales for LLM and agent evaluation. Here's why, and how to keep nuance without the inconsistency.

EvaluationBest Practices+3 more

Björn PlüsterJan 2, 20269 min

What Makes a Good Test Set?

How we built a test set for a German health insurer's AI search—from 50 real user queries to 80 cases, 57 experiments, and a pass rate that climbed from 35% to over 80%.

TestingEvaluation+2 more

Alicia GarcíaAug 25, 202512 min

Getting started with elluminate for Evaluations

Learn how to systematically test and improve your AI prompts using elluminate's evaluation platform. Walk through a complete example using pizza toppings to understand prompt templates, collections, criteria, and experiments.

Getting StartedEvaluation+1 more