What You’ll Learn: How do you know if your LangChain application actually works well? This article covers the evaluation toolkit: LangSmith for tracing and debugging, custom evaluators for measuring quality, dataset management for regression testing, and the debugging workflow that turns ‘it’s not working’ into ‘I can see exactly why it failed on line 47.’ Think of LangSmith like Chrome DevTools for LLM applications: you see every step, every decision, and every token.
1. LangSmith Setup
1.1 Configuration
import os
# Required environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..." # From smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-ai-app" # Project name for grouping traces
# Optional: endpoint for self-hosted
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"
1.2 Automatic Tracing
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# With LANGCHAIN_TRACING_V2=true, ALL LangChain calls are automatically traced
model = ChatOpenAI(model="gpt-4o-mini")
chain = ChatPromptTemplate.from_template("Explain {topic}") | model | StrOutputParser()
# This invocation is automatically logged to LangSmith
result = chain.invoke({"topic": "vector databases"})
# View trace at: smith.langchain.com → my-ai-app project
1.3 Manual Trace Annotation
from langsmith import traceable
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o-mini")
@traceable(name="my_custom_pipeline", tags=["production", "v2"])
def my_pipeline(question: str) -> str:
"""Custom pipeline with manual tracing."""
# All LangChain calls inside are nested under this trace
response = model.invoke(question)
processed = response.content.upper() # Custom processing
return processed
result = my_pipeline("What is LCEL?")
Real-World Application
Systematic Quality Improvement
A customer support chatbot team used LangSmith to identify why 15% of responses were unhelpful. Tracing revealed the issue: their retriever was returning irrelevant chunks for questions about billing (wrong similarity threshold). After tuning the threshold and adding a re-ranker, unhelpful responses dropped to 3%. The evaluation pipeline now runs on every PR.
LangSmith TracingEvaluation PipelineCI/CD Integration
4. Automated Evaluation
4.1 Built-in Evaluators
from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator
client = Client()
# Define your chain/function to evaluate
def my_app(inputs: dict) -> dict:
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
result = model.invoke(inputs["question"]).content
return {"output": result}
# Run evaluation against a dataset
results = evaluate(
my_app,
data="my-qa-dataset", # Dataset name in LangSmith
evaluators=[
LangChainStringEvaluator("qa"), # Correctness
LangChainStringEvaluator("helpfulness"), # Helpfulness
LangChainStringEvaluator("conciseness"), # Conciseness
],
experiment_prefix="gpt4o-mini-v1"
)
print(f"Mean score: {results.aggregate_metrics}")
4.2 Custom Evaluators
from langsmith.evaluation import evaluate, EvaluationResult
from langsmith.schemas import Run, Example
def response_length_evaluator(run: Run, example: Example) -> EvaluationResult:
"""Check if response is within acceptable length."""
output = run.outputs.get("output", "")
word_count = len(output.split())
score = 1.0 if 50 <= word_count <= 200 else 0.0
return EvaluationResult(
key="response_length",
score=score,
comment=f"Word count: {word_count}"
)
def contains_citation_evaluator(run: Run, example: Example) -> EvaluationResult:
"""Check if response includes citations."""
output = run.outputs.get("output", "")
has_citation = "[" in output and "]" in output
return EvaluationResult(
key="has_citations",
score=1.0 if has_citation else 0.0
)
results = evaluate(
my_app,
data="my-qa-dataset",
evaluators=[response_length_evaluator, contains_citation_evaluator],
experiment_prefix="citation-check-v1"
)
Summary & Next Steps
This completes the LangChain SDK implementation for the concepts covered in Part 15: Evaluation & LLMOps.
Try It Yourself: Set up a complete evaluation pipeline: (1) create a test dataset of 15 Q&A pairs with ground truth, (2) build a custom evaluator that scores relevance (0–1), (3) run your RAG chain against the dataset in LangSmith, (4) identify the 3 worst-performing questions, (5) improve your retrieval/prompt and re-run to measure improvement. Document the before/after scores.
Related Articles
Foundation: Part 15: Evaluation & LLMOps
The framework-agnostic concepts behind this article.
Read Article
LC Part 5: LangGraph Workflows
Previous article in the LangChain SDK Track.
Read Article