Marco Patzelt — Software Engineer · AI Harness Layer
I build the harness.So agents own the workflow.
Tools, schemas, environments, feedback loops — the layer that wraps the LLM. Environment design over prompt engineering.

Open Source · Receipts
Methodology, shipped publicly.
Four repositories. Same thesis — environment design over prompt engineering — tested across four layers.
Brunnfeld
A multi-agent simulation testing environment-driven alignment. 19 LLM agents operate inside a deterministic game engine through structured tool calls — minimal instructions, rich environmental feedback. Methodology proof: the same patterns translate to commercial workflows.

Reception
146K views.
Two threads.
Brunnfeld picked up on r/Anthropic and r/BlackboxAI_. The discussion landed on the methodology — environment-driven alignment, structured tool calls — not the medieval surface.
Invitation
Invited by Anthropic's Head of Community to the Code w/ Claude event off the back of this project.
Sales Agent
Skill-based outbound automation on MCP. Pluggable CRM adapters, per-channel rate limits, never-invent-details rule, hard error stops, human-in-the-loop feedback. The harness handles the rails — the agent owns the workflow.
Agent Factory
Autonomous system that finds real problems on Reddit/HN/GitHub and ships specialized agents to solve them. Demonstrates the loop: discovery → scoring → build → ship, with environment shaping at every step.
Code Commander
Desktop command center for managing multiple AI coding agent sessions across codebases. Multi-agent orchestration over MCP. Built because I needed it.
Harness Layer.
Context · Tools · Memory · Sandbox · Feedback.
The real engineering work in agentic systems isn't writing better prompts — it's building the harness around the LLM. Tool schemas, structured feedback, validation, safety limits. Get the environment right, and agents can own production workflows end-to-end.
Context
Engineering
Token budget as design parameter
The context window is a scarce resource. Perception schemas at the boundary, trajectory compression on recall, deterministically generated world snapshots. The agent sees what you let it see — nothing more.
Structured perception
JSON in, JSON out. Engine validates. No hallucination at the interface — the snapshot is built deterministically, not assembled by the model.
import { NextResponse } from 'next/server';
import { createClient } from '@supabase/ssr';
export async function POST(req: Request) {
// Typsichere Validierung
const body = await req.json();
const { id } = Schema.parse(body);
const supabase = createClient();
const { data } = await supabase
.from('events')
.select('*');
return NextResponse.json(data);
}
Harness Runtime
All loops healthy
Agent Memory
Working · episodic · semantic. Hybrid and persistent — the agent builds context from its own experience.
The principle: Memory is queried, not pushed into the prompt. Slices on demand instead of full dump. Skills from experience, not from the system prompt — the agent builds its own context.
Tools · Sandbox · Feedback
Structured tool calls, ephemeral sandboxes, errors as feedback signal.
The result: 30s per call, 20min per loop, ephemeral per sandbox. Verification + retry with backoff. When 20 steps cascade at 95% each you land at 36% — the harness catches that.
Stack & Ecosystem
Stack & Ecosystem.
Anthropic, MCP, OpenRouter, Pi.dev — the layer I work in. Production plumbing on top: TypeScript, Vercel, Supabase.
From schema to production code.
I work the full stack — schemas, harness logic, deployment, frontend. No hand-offs between specialists, no integration drift. One engineer, every layer.
Schema → UI
Agent output types flow directly into React props. Same TypeScript end-to-end — from tool-call schema to rendered element. No drift between backend and UI.
Streaming State
Agent thinking, tool calls, partial edits — live in the UI. SSE streams, Canvas rendering, structured status events. The user sees what the agent does, while it does it.
Interface Layer
The frontend isn't a coat of paint — it's the interface between agent and human. Where approvals happen, where edits get made, where trust is won or lost. Production surface, not decoration.
Day Job
Software & Integration Engineer.
Production middleware on Microsoft Dynamics 365, Webflow CMS, HubSpot, and adjacent enterprise systems — real-time sync, distributed locking, EU-pinned deployment. Plus an internal agentic fleet for SEO and paid-channel automation across the agency's client roster. Where the harness expertise originally came from is up in the open-source section — Brunnfeld and the other agentic systems built from the inside.
Full Ownership
Requirements → architecture → implementation → deployment. Direct technical contact with the client, no hand-offs between specialists.
Real-time Sync Middleware
Webhook-driven sync between Microsoft Dynamics 365 and Webflow CMS. Async processing via Vercel waitUntil(), Redis-based distributed locks to prevent race conditions on concurrent webhook deliveries.
Lead-Capture & Qualification API
Stateless serverless API powering multi-step quiz funnels. Auto-qualification, CRM upsert, Google Ads attribution bridge — closes the attribution gap from non-native form submissions.
Smart Translation Caching
DeepL calls fingerprinted and cached on Vercel KV (Redis) — calls only when text actually changes. Responses rebuilt from fresh CRM data plus cached translations, near-zero API cost. Exponential-backoff on outbound clients to absorb rate limits.
Multi-System Integration
Dynamics 365 (OAuth2 + custom actions), Webflow CMS v2, HubSpot (EU1), WhatConverts, DeepL, Upstash. JWT-authenticated webhooks, end-to-end attribution pipelines.
EU Production Infrastructure
Vercel with region-pinned deployments for EU latency. Sandbox/prod separation, distributed caching and locking. Custom booking engines, seasonal logic, multi-locale (including German umlaut slug normalization).
Agentic · Internal tool
SEO & Paid-Channel Fleet
Multi-tenant agent system for the agency's client roster. Persistent task board the agent manages itself across runs. 13 in-process tools — GSC queries, paid-channel audits, content briefs, article generation. End-to-end Webflow publishing on a weekly cadence. Same harness patterns from the open-source work, applied internally.
Featured System · Production live
Marketing platform
↔ Dynamics 365.
The Challenge
Production middleware between a modern marketing platform and Microsoft Dynamics 365 CRM. Distributed locking, realtime sync, EU-pinned deployment on Vercel. Runs daily in production.
The Architecture
POST /api/ingest"key": "sys_123_lock",
"ttl": 60,
"status": "acquired"
"text": "Complex Entity",
"source": "RAW",
"target": "NORM"
await crm.create(data)FAQ
The obvious questions.
Common questions on agent harnesses, production reliability, and the gap between demo and prod.
MCP wins on standardization: one wire format, dropping new agents into existing tool surfaces, swapping models without rewriting integrations. Direct API integration wins on reliability — when the MCP server doesn't expose certain fields or endpoints, when you need custom retry/error semantics, when tool-call latency is on the critical path, or when vendor-specific edge cases (pagination quirks, idempotency keys, partial responses) get swallowed by the MCP wrapper. Rule of thumb: MCP for breadth, direct integration for the 2-3 critical tools that can't afford to flake. Most production setups end up mixed.
The harness is everything around the LLM that lets it do real work — tool schemas, validation, structured feedback, retry logic, safety limits. The model picks moves; the harness defines what moves are even possible and what happens when one fails. Two systems on the same model perform completely differently based on harness quality. That's where the engineering leverage actually sits.
RAG retrieves context to inform a single LLM call. An agent loops: it acts, observes, decides, and acts again — usually with multiple tool calls and external state changes. If the problem is "answer questions over our docs" → RAG. If it's "execute a multi-step workflow with our systems" → agent. Most production setups end up using both — RAG as a tool inside the agent's surface.
Compounding error. A 95% per-step success rate cascades to 36% over 20 steps. Demo paths run 3-5 steps under controlled conditions; production loops run 20+ steps over messy real data. The fix is in the harness: verification at each step, structured retries with backoff, hard stops on uncertain branches, escalation to humans at risk surfaces. Models don't get reliable — harnesses make them reliable.
MCP is a standard for how agents connect to tools and data sources. Think USB-C for agents — one wire format, many backends. Anthropic, OpenAI, and major frameworks now support it. If you're building agents that need to access multiple internal systems, yes — standardize on MCP. If you're building one tightly-scoped agent against one API, native integration is still fine. Don't migrate working code without a reason.
Three layers. (1) Trajectory tests — replay full agent runs against canonical inputs, assert on intermediate states, not just final outputs. (2) Tool-call evals — for each tool: happy-path, structured-error-path, adversarial inputs. (3) Production telemetry — log every tool call, retry, escalation, and per-loop cost. Production-ready means: trajectory tests pass at >95%, no unhandled tool errors in a 7-day prod window, cost-per-task within the budget envelope. Vibes are not a definition of done.
Engineering Logs
Notes from shipping.
Agentic Orchestration: AI That Writes Integration Code
I stopped writing static endpoints. Now agents powered by Gemini 3 Pro generate integration code at runtime. From weeks-to-insight to seconds-to-answer.
Generalist vs Specialist: Why Silos Kill Startups
Frontend silos and backend ivory towers are 2015 relics. One Product Engineer with Supabase and AI replaces three specialists. Fewer handoffs, more shipped.
Glass Box Trust: UX for AI Agents That Show Their Work
Trust in AI is built on transparency, not magic. I explain how to break the black box by visualizing an agent System 2 reasoning, from SQL queries to consensus.

Get in touch
Let's connect.
If you're working on something serious in agentic infrastructure — tool design, harness engineering, orchestration loops — drop me a line.