Company Profile: NeuralEdge Inc.
For this capstone, we'll design the North Star Architecture for NeuralEdge Inc. — a Series C AI-first company that builds enterprise productivity tools powered by foundation models, multi-agent systems, and real-time learning.
| Attribute | Details |
|---|---|
| Industry | Enterprise AI SaaS (productivity, automation, analytics) |
| Stage | Series C, 350 employees, $80M ARR |
| Products | AI writing assistant, code copilot, analytics agent, workflow automation |
| Users | 200K enterprise seats across 400 companies |
| Current State | Monolithic Python/Flask app; single PostgreSQL DB; manual model deployment |
| Pain Points | 2-week model deploy cycle, no feature reuse, scaling bottlenecks, no agent framework |
Business Objectives
- Ship new AI features weekly (currently: monthly)
- Enable multi-agent workflows for complex enterprise tasks
- Reduce inference costs by 40% via model routing and caching
- Support 10x user growth without proportional infrastructure cost
- Achieve SOC2 + enterprise compliance for large customer deals
AI-First Architectural Principles
- Inference-Native — Every service has ML inference as a first-class output, not a bolted-on feature
- Data Flywheel — Every user interaction produces training signal; systems improve with usage
- Agent-Orchestrated — Complex workflows are AI-agent-driven, not hard-coded pipelines
- Model-Agnostic — Architecture supports any model (proprietary, open-source, fine-tuned) behind unified interfaces
- Composable Intelligence — AI capabilities are building blocks; products assemble them differently
- Observability-First — Every inference, decision, and agent step is traced, scored, and auditable
Target State Architecture
flowchart TB
subgraph Experience["🌐 Product Layer"]
direction LR
P1[Writing Assistant]
P2[Code Copilot]
P3[Analytics Agent]
P4[Workflow Automation]
end
subgraph Agents["🤖 Agent Orchestration Layer"]
direction LR
A1[Agent Router]
A2[Tool Registry]
A3[Memory Store]
A4[Safety Guard]
end
subgraph ML["🧠 ML Platform"]
direction LR
M1[Model Registry]
M2[Inference Gateway]
M3[Feature Store]
M4[Fine-Tune Pipeline]
end
subgraph Data["📊 Data Platform"]
direction LR
D1[Event Stream]
D2[Interaction Lake]
D3[Feedback Loop]
D4[Eval Pipeline]
end
subgraph Infra["☁️ Infrastructure"]
direction LR
I1[GPU Cluster]
I2[K8s + Autoscale]
I3[Edge Cache]
I4[Observability]
end
Experience --> Agents
Agents --> ML
ML --> Data
Data --> Infra
style Experience fill:#e8f4f4,stroke:#3B9797
style Agents fill:#f0f4f8,stroke:#16476A
style ML fill:#e8f4f4,stroke:#3B9797
style Data fill:#f0f4f8,stroke:#16476A
style Infra fill:#e8f4f4,stroke:#3B9797
Platform Layer Details
ML Platform
The ML Platform is the core differentiator — it makes model development, deployment, and monitoring a self-service experience for product teams. In a traditional company, deploying a model requires a handoff from data scientists to ML engineers to DevOps. In an AI-first architecture, the platform automates this entire chain.
Model Registry & Versioning
Every model artifact — from experimental notebooks to production-ready weights — lives in a centralized registry with full lineage tracking. Teams can trace any production prediction back to the exact training data, hyperparameters, and code commit that produced it.
- Metadata tracking: Training dataset hash, evaluation metrics, hardware used, training duration, and cost
- Promotion stages: Experimental → Staging → Shadow (receives traffic but responses discarded) → Canary (5% traffic) → Production
- Rollback capability: Any production model can be rolled back to previous version in under 60 seconds
- A/B comparison: Built-in experiment framework compares model versions on live traffic with statistical significance testing
Inference Gateway
The inference gateway is a unified API layer that abstracts model complexity from product teams. Instead of calling specific model endpoints, products call a capability endpoint (e.g., /v1/summarize or /v1/classify), and the gateway routes to the optimal model based on cost, quality, and latency constraints.
This enables several critical capabilities:
- Cost optimization: Route simple queries to smaller, cheaper models; escalate complex queries to frontier models
- Graceful degradation: If GPT-4o is down, automatically fall back to Claude Sonnet, then to self-hosted models
- Semantic caching: Cache semantically similar queries (embeddings within cosine distance threshold) to avoid redundant inference
- Rate limiting & quotas: Per-customer, per-product, and per-model usage tracking with configurable guardrails
| Component | Purpose | Technology |
|---|---|---|
| Model Registry | Version, track, promote models | MLflow + custom metadata |
| Inference Gateway | Unified API; routes to best model per request | Custom router + vLLM / TGI |
| Feature Store | Real-time + batch features for model input | Feast + Redis + DeltaLake |
| Fine-Tune Pipeline | Continuous improvement from user feedback | Ray Train + LoRA adapters |
| Eval Pipeline | Automated quality gates before promotion | Custom evals + human-in-loop |
Inference Gateway Configuration
{
"inference_gateway": {
"routing_strategy": "cost_quality_latency_optimize",
"models": [
{ "id": "gpt-4o", "provider": "openai", "cost_per_1k": 0.005, "quality_score": 0.95 },
{ "id": "claude-sonnet", "provider": "anthropic", "cost_per_1k": 0.003, "quality_score": 0.93 },
{ "id": "neuraledge-v3", "provider": "self-hosted", "cost_per_1k": 0.001, "quality_score": 0.88 }
],
"fallback_chain": ["neuraledge-v3", "claude-sonnet", "gpt-4o"],
"cache": { "semantic_cache": true, "ttl_seconds": 3600 },
"routing_rules": [
{ "condition": "token_count < 200 AND complexity_score < 0.4", "route_to": "neuraledge-v3" },
{ "condition": "requires_reasoning OR token_count > 2000", "route_to": "gpt-4o" },
{ "condition": "default", "route_to": "claude-sonnet" }
]
}
}
Feature Store Architecture
The feature store bridges the gap between raw data and model-ready features. It provides two access patterns: batch features (computed hourly/daily for training) and real-time features (computed per-request for inference). Without a feature store, every team recomputes the same features independently — leading to training/serving skew and duplicated compute.
Key feature categories for NeuralEdge:
- User behavior features: Session duration, interaction frequency, acceptance rate history, preferred output length
- Document context features: Document type, language, domain classification, readability score, entity density
- Model performance features: Per-user quality scores, latency percentiles, error rates, cost per interaction
- Temporal features: Time-of-day patterns, weekly usage trends, seasonal demand variations
Data Platform
In an AI-first company, the data platform exists primarily to feed the learning flywheel. Unlike traditional analytics-focused data warehouses, NeuralEdge's data platform is optimized for ML consumption — producing clean, labeled, feature-rich datasets that continuously improve model quality.
Event Streaming & Interaction Capture
Every user interaction is captured as a structured event and published to Kafka within milliseconds. This includes not just explicit actions (clicks, submissions) but implicit signals that reveal quality:
- Acceptance signals: User accepts AI suggestion as-is (strong positive signal)
- Edit signals: User modifies AI output before using it (partial positive — the diff becomes training data)
- Rejection signals: User dismisses suggestion or regenerates (negative signal)
- Latency signals: Time between suggestion appearing and user acting (correlates with quality)
- Context signals: What the user was doing before/after the AI interaction (enriches training pairs)
- Capture — Every user interaction → Kafka event stream (p99 latency <50ms)
- Store — Raw events → Interaction Lake (Apache Iceberg on S3, partitioned by date and product)
- Label — Implicit signals (accepted/rejected, edits, time-to-accept) → training labels via automated labeling pipeline
- Curate — Deduplication, PII removal, quality filtering, diversity sampling → clean training dataset
- Train — Continuous fine-tuning on latest interaction data (daily LoRA adapters, weekly full fine-tunes)
- Deploy — Promote improved model via automated eval gates (must exceed incumbent on held-out test set)
- Measure — A/B test new model vs incumbent on live traffic → statistical significance before full rollout
Interaction Lake Schema
The Interaction Lake stores every AI interaction in a schema designed for ML training. Each record captures the full context needed to reproduce and improve the interaction:
{
"interaction_id": "uuid-v7",
"timestamp": "2026-04-30T14:23:17.442Z",
"user_id": "usr_hashed_abc123",
"product": "writing_assistant",
"context": {
"document_type": "email",
"preceding_text": "...(last 500 chars)...",
"cursor_position": 1247,
"session_interactions_count": 8
},
"model_input": {
"prompt_tokens": 342,
"system_prompt_version": "wa-v3.2",
"features_snapshot": { "user_accept_rate_7d": 0.73, "doc_readability": 8.2 }
},
"model_output": {
"model_id": "neuraledge-v3",
"completion_tokens": 89,
"latency_ms": 312,
"output_text": "...(generated text)..."
},
"outcome": {
"action": "accepted_with_edit",
"edit_distance": 12,
"time_to_action_ms": 2340,
"final_text": "...(what user actually used)..."
}
}
Privacy & Compliance Layer
Enterprise customers require strict data handling. The data platform includes built-in privacy controls:
- Data residency: Per-customer configuration determines which region stores their interaction data
- Retention policies: Automatic deletion after configurable period (default 90 days for raw, 1 year for aggregated)
- PII detection: Automated scanning removes personally identifiable information before training use
- Opt-out controls: Customers can opt out of data use for model improvement while still using the product
- Audit trail: Complete lineage from training data → model → prediction for compliance audits
Agent Orchestration Layer
The agent layer is what makes NeuralEdge's products "intelligent" — instead of hard-coded workflows, AI agents dynamically compose tools to solve user problems. This is the key architectural distinction between an AI-feature company (adds ML to existing flows) and an AI-first company (agents are the flows).
Agent Router
The agent router classifies incoming requests by complexity and routes them to the appropriate execution path:
- Single-shot requests: Simple completions, classifications, or lookups that need one model call (70% of traffic, <500ms latency target)
- Multi-step requests: Complex tasks requiring tool use, reasoning chains, or multiple model calls (25% of traffic, <10s latency target)
- Agentic workflows: Long-running tasks spanning minutes/hours — research, report generation, multi-system orchestration (5% of traffic, async with progress updates)
Tool Registry & Safety
Agents can only use tools that are registered, versioned, and sandboxed. Each tool has a capability description (used by the agent to decide when to invoke it), input/output schemas, rate limits, and permission scopes:
{
"tool_registry": {
"tools": [
{
"id": "web_search",
"description": "Search the web for current information",
"permissions": ["read_external"],
"rate_limit": "10 calls/minute/user",
"sandbox": "network_isolated_container"
},
{
"id": "code_execution",
"description": "Execute Python code in a sandboxed environment",
"permissions": ["compute_limited"],
"rate_limit": "5 calls/minute/user",
"sandbox": "firecracker_microvm",
"resource_limits": { "cpu": "0.5 cores", "memory": "512MB", "timeout": "30s" }
},
{
"id": "database_query",
"description": "Query customer's connected data sources",
"permissions": ["read_customer_data"],
"rate_limit": "20 calls/minute/user",
"sandbox": "row_level_security_enforced"
}
],
"safety_guard": {
"pre_execution": ["pii_detection", "prompt_injection_scan", "scope_validation"],
"post_execution": ["output_filtering", "hallucination_check", "toxicity_scan"]
}
}
}
Memory Architecture
Agents maintain context across interactions through a three-tier memory system:
- Working memory: Current conversation context (lives in request scope, discarded after session)
- Episodic memory: Past interactions with this user/document (stored in vector DB, retrieved by similarity)
- Semantic memory: Organizational knowledge — company style guides, product docs, domain terminology (shared across users in same org)
flowchart LR
U[User Request] --> R[Agent Router]
R --> |Simple| S[Single-Shot Agent]
R --> |Complex| M[Multi-Step Agent]
M --> T1[Tool: Search]
M --> T2[Tool: Code Exec]
M --> T3[Tool: API Call]
M --> T4[Tool: Data Query]
T1 --> Mem[Memory Store]
T2 --> Mem
T3 --> Mem
T4 --> Mem
Mem --> Resp[Response Synthesizer]
S --> Resp
Resp --> G[Safety Guard]
G --> U2[User Response]
style R fill:#3B9797,stroke:#3B9797,color:#fff
style G fill:#BF092F,stroke:#BF092F,color:#fff
style Mem fill:#16476A,stroke:#16476A,color:#fff
Gap Analysis: Current vs Target
| Dimension | Current State | North Star Target | Gap Severity |
|---|---|---|---|
| Model Deployment | Manual, 2-week cycle | Automated, <1 hour | Critical |
| Feature Reuse | None — features computed per-service | Centralized feature store | Critical |
| Agent Framework | None | Multi-agent orchestration | High |
| Inference Routing | Hardcoded to single model | Dynamic cost/quality routing | High |
| Data Flywheel | Manual data collection | Automated capture → label → train | Critical |
| Observability | Basic logs | Full trace per inference + agent step | High |
| Scalability | Single Flask app | Auto-scaling microservices | Critical |
Migration Roadmap
The migration follows the "strangler fig" pattern — new capabilities are built alongside the existing monolith, gradually taking over traffic until the legacy system can be decommissioned. Each phase delivers independent value, so the transformation pays for itself along the way.
| Phase | Timeline | Focus | Key Deliverables |
|---|---|---|---|
| Phase 1 | Months 1-4 | Foundation | Kubernetes migration, inference gateway, basic observability |
| Phase 2 | Months 5-8 | ML Platform | Feature store, model registry, automated eval pipeline |
| Phase 3 | Months 9-12 | Agent Layer | Tool registry, agent router, memory store, safety guard |
| Phase 4 | Months 13-18 | Flywheel | Data flywheel automation, continuous fine-tuning, full decomposition |
Phase 1: Foundation (Months 1–4)
The first phase focuses on infrastructure that unblocks everything else. Moving from a single Flask app to Kubernetes enables independent scaling and deployment of services. The inference gateway provides immediate value by reducing costs through model routing and caching.
- Month 1: Containerize monolith (Docker), deploy to Kubernetes, set up CI/CD pipelines
- Month 2: Extract inference gateway as first microservice — all model calls route through it
- Month 3: Implement semantic caching (reduces inference costs 20-30% immediately), set up observability (distributed tracing with OpenTelemetry, metrics with Prometheus)
- Month 4: Add model fallback chains and cost-based routing; deploy Grafana dashboards for cost/latency/quality monitoring
Success metrics: Model deploy time reduced from 2 weeks to <4 hours. Inference costs reduced 25%. System handles 3x current peak traffic without degradation.
Phase 2: ML Platform (Months 5–8)
With infrastructure stable, the team builds the ML platform that enables self-service model development and deployment:
- Month 5: Deploy model registry (MLflow); migrate all existing models with versioning and metadata
- Month 6: Build feature store — batch features (DeltaLake) + real-time features (Redis); migrate top-10 features from hardcoded computation
- Month 7: Implement automated evaluation pipeline — models must pass quality gates (accuracy, latency, cost) before promotion
- Month 8: Build fine-tuning pipeline with LoRA adapters; first automated fine-tune on user interaction data
Success metrics: Model deploy time reduced to <1 hour (automated pipeline). Feature reuse across 3+ products. First model improvement from automated fine-tuning (measurable quality uplift on eval set).
Phase 3: Agent Layer (Months 9–12)
The agent layer transforms products from "AI-enhanced" to "AI-native" — enabling dynamic, multi-step problem solving:
- Month 9: Build tool registry with sandboxed execution environments; register first 5 tools (search, code exec, data query, document retrieval, API call)
- Month 10: Implement agent router with complexity classification; deploy single-shot and multi-step execution paths
- Month 11: Build memory store (vector DB for episodic memory, Redis for working memory); integrate with agent execution
- Month 12: Deploy safety guard (pre/post execution validation); launch first agentic product feature (workflow automation agent)
Success metrics: Agent-powered features handle 30% of complex user requests. Safety guard blocks 99.9% of out-of-scope actions. User satisfaction on multi-step tasks improves 40%.
Phase 4: Flywheel (Months 13–18)
The final phase closes the learning loop — every interaction automatically improves the system:
- Months 13-14: Complete event streaming pipeline (all interactions → Kafka → Interaction Lake); deploy automated labeling
- Months 15-16: Implement continuous fine-tuning (daily LoRA, weekly full); automated A/B testing framework for model promotions
- Months 17-18: Decompose remaining monolith services; achieve full microservices architecture; optimize cost per inference to target
Success metrics: Models improve weekly without manual intervention. Inference cost reduced 40% from baseline. System supports 10x user growth. Full SOC2 compliance achieved.
- Data quality degradation: Automated data quality monitoring with alerts when label distribution shifts unexpectedly
- Model regression: Shadow deployment mandatory before any production promotion; automated rollback on quality degradation
- Agent safety failures: Red-team testing before each tool addition; production kill-switches per-tool and per-agent
- Cost explosion: Per-customer cost budgets with hard caps; automated model downgrades when budget exhausted
Conclusion
An AI-first North Star Architecture fundamentally differs from traditional enterprise architecture. The entire stack exists to produce, serve, and improve intelligence. The data platform feeds the ML platform, the ML platform powers the agent layer, and the agent layer delivers product value — all connected by a continuous learning flywheel that makes the system smarter with every interaction.
Architecture Decision Summary
| Decision | Choice | Rationale |
|---|---|---|
| Inference strategy | Gateway with dynamic routing | Balances cost, quality, and latency; enables model-agnostic products |
| Data architecture | Event-driven interaction lake | Captures implicit training signals; enables continuous improvement |
| Agent framework | Tool registry with sandboxed execution | Safety-first design; extensible without code changes |
| Feature management | Centralized feature store | Eliminates training/serving skew; enables cross-product feature reuse |
| Migration approach | Strangler fig with phased delivery | Each phase delivers independent value; no big-bang risk |
| Model improvement | Automated flywheel with human eval gates | Speed of automation with safety of human oversight |
Measuring Success
The NSA isn't complete until these outcomes are achieved:
- Speed: New AI feature from idea to production in <1 week (was: 1 month)
- Cost: Inference cost per user interaction reduced 40% through routing, caching, and self-hosted models
- Quality: Model quality improves automatically week-over-week without manual intervention
- Scale: System handles 10x current load without proportional cost increase
- Safety: Zero critical safety incidents; full audit trail for enterprise compliance
- Autonomy: Agent-powered features handle 50%+ of complex user workflows