LangChain SDK Track Part 6: LangSmith — Tracing & LLMOps

                        
                        What You’ll Learn: How do you know if your LangChain application actually works well? This article covers the evaluation toolkit: LangSmith for tracing and debugging, custom evaluators for measuring quality, dataset management for regression testing, and the debugging workflow that turns ‘it’s not working’ into ‘I can see exactly why it failed on line 47.’ Think of LangSmith like Chrome DevTools for LLM applications: you see every step, every decision, and every token.
                    

1. LangSmith Setup

SDK Track Note: This is the LangChain SDK Track — a hands-on companion to Foundation Track Part 15 (Evaluation & LLMOps). Read that article first for observability concepts.

1.1 Configuration

import os

# Required environment variables
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "lsv2_pt_..."  # From smith.langchain.com
os.environ["LANGCHAIN_PROJECT"] = "my-ai-app"     # Project name for grouping traces

# Optional: endpoint for self-hosted
# os.environ["LANGCHAIN_ENDPOINT"] = "https://api.smith.langchain.com"

1.2 Automatic Tracing

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# With LANGCHAIN_TRACING_V2=true, ALL LangChain calls are automatically traced
model = ChatOpenAI(model="gpt-4o-mini")
chain = ChatPromptTemplate.from_template("Explain {topic}") | model | StrOutputParser()

# This invocation is automatically logged to LangSmith
result = chain.invoke({"topic": "vector databases"})
# View trace at: smith.langchain.com → my-ai-app project

1.3 Manual Trace Annotation

from langsmith import traceable
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o-mini")

@traceable(name="my_custom_pipeline", tags=["production", "v2"])
def my_pipeline(question: str) -> str:
    """Custom pipeline with manual tracing."""
    # All LangChain calls inside are nested under this trace
    response = model.invoke(question)
    processed = response.content.upper()  # Custom processing
    return processed

result = my_pipeline("What is LCEL?")

Real-World Application

Systematic Quality Improvement

A customer support chatbot team used LangSmith to identify why 15% of responses were unhelpful. Tracing revealed the issue: their retriever was returning irrelevant chunks for questions about billing (wrong similarity threshold). After tuning the threshold and adding a re-ranker, unhelpful responses dropped to 3%. The evaluation pipeline now runs on every PR.

LangSmith TracingEvaluation PipelineCI/CD Integration

4. Automated Evaluation

4.1 Built-in Evaluators

from langsmith import Client
from langsmith.evaluation import evaluate, LangChainStringEvaluator

client = Client()

# Define your chain/function to evaluate
def my_app(inputs: dict) -> dict:
    from langchain_openai import ChatOpenAI
    from langchain_core.output_parsers import StrOutputParser
    model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    result = model.invoke(inputs["question"]).content
    return {"output": result}

# Run evaluation against a dataset
results = evaluate(
    my_app,
    data="my-qa-dataset",  # Dataset name in LangSmith
    evaluators=[
        LangChainStringEvaluator("qa"),           # Correctness
        LangChainStringEvaluator("helpfulness"),  # Helpfulness
        LangChainStringEvaluator("conciseness"),  # Conciseness
    ],
    experiment_prefix="gpt4o-mini-v1"
)

print(f"Mean score: {results.aggregate_metrics}")

4.2 Custom Evaluators

from langsmith.evaluation import evaluate, EvaluationResult
from langsmith.schemas import Run, Example

def response_length_evaluator(run: Run, example: Example) -> EvaluationResult:
    """Check if response is within acceptable length."""
    output = run.outputs.get("output", "")
    word_count = len(output.split())
    score = 1.0 if 50 <= word_count <= 200 else 0.0
    return EvaluationResult(
        key="response_length",
        score=score,
        comment=f"Word count: {word_count}"
    )

def contains_citation_evaluator(run: Run, example: Example) -> EvaluationResult:
    """Check if response includes citations."""
    output = run.outputs.get("output", "")
    has_citation = "[" in output and "]" in output
    return EvaluationResult(
        key="has_citations",
        score=1.0 if has_citation else 0.0
    )

results = evaluate(
    my_app,
    data="my-qa-dataset",
    evaluators=[response_length_evaluator, contains_citation_evaluator],
    experiment_prefix="citation-check-v1"
)

Summary & Next Steps

This completes the LangChain SDK implementation for the concepts covered in Part 15: Evaluation & LLMOps.

                        
                        Try It Yourself: Set up a complete evaluation pipeline: (1) create a test dataset of 15 Q&A pairs with ground truth, (2) build a custom evaluator that scores relevance (0–1), (3) run your RAG chain against the dataset in LangSmith, (4) identify the 3 worst-performing questions, (5) improve your retrieval/prompt and re-run to measure improvement. Document the before/after scores.