Software Engineer · AI Harness Layer
Marco Patzelt — AI Agent Engineer.
Tools, schemas, sandboxes, feedback loops — the layer that wraps the LLM.

Harness Layer.
Context · Tools · Memory · Sandbox · Feedback.
What I work on sits around the LLM: tool schemas, structured feedback, validation, safety limits. Get that layer right and the agent can carry a real workflow end-to-end.
Context
Engineering
Token budget as design parameter
The context window is a scarce resource. Perception schemas at the boundary, trajectory compression on recall, deterministically generated world snapshots. The agent sees what you let it see — nothing more.
Structured perception
JSON in, JSON out. Engine validates. No hallucination at the interface — the snapshot is built deterministically, not assembled by the model.
import { NextResponse } from 'next/server';
import { createClient } from '@supabase/ssr';
export async function POST(req: Request) {
// Typsichere Validierung
const body = await req.json();
const { id } = Schema.parse(body);
const supabase = createClient();
const { data } = await supabase
.from('events')
.select('*');
return NextResponse.json(data);
}
Harness Runtime
All loops healthy
Agent Memory
Working · episodic · semantic. Hybrid and persistent — the agent builds context from its own experience.
The principle: Memory is queried, not pushed into the prompt. Slices on demand instead of full dump. Skills from experience, not from the system prompt — the agent builds its own context.
Tools · Sandbox · Feedback
Structured tool calls, ephemeral sandboxes, errors as feedback signal.
The result: 30s per call, 20min per loop, ephemeral per sandbox. Verification + retry with backoff. When 20 steps cascade at 95% each you land at 36% — the harness catches that.
Stack & Ecosystem
Stack & Ecosystem.
Anthropic, MCP, OpenRouter, Pi.dev — the layer I work in. Production plumbing on top: TypeScript, Vercel, Supabase.
From schema to production code.
I work the full stack — schemas, harness logic, deployment, frontend. No hand-offs between specialists, no integration drift. One engineer, every layer.
Schema → UI
Agent output types flow directly into React props. Same TypeScript end-to-end — from tool-call schema to rendered element. No drift between backend and UI.
Streaming State
Agent thinking, tool calls, partial edits — live in the UI. SSE streams, Canvas rendering, structured status events. The user sees what the agent does, while it does it.
Interface Layer
The frontend isn't a coat of paint — it's the interface between agent and human. Where approvals happen, where edits get made, where trust is won or lost. Production surface, not decoration.
Day Job
Software & Integration Engineer.
Production middleware on Microsoft Dynamics 365, Webflow CMS, HubSpot, and adjacent enterprise systems — real-time sync, distributed locking, EU-pinned deployment. Plus an internal agentic fleet for SEO and paid-channel automation across the agency's client roster. A few open-source side projects below where I push the same patterns past what the day job calls for.
Full Ownership
Requirements → architecture → implementation → deployment. Direct technical contact with the client, no hand-offs between specialists.
Real-time Sync Middleware
Webhook-driven sync between Microsoft Dynamics 365 and Webflow CMS. Async processing via Vercel waitUntil(), Redis-based distributed locks to prevent race conditions on concurrent webhook deliveries.
Lead-Capture & Qualification API
Stateless serverless API powering multi-step quiz funnels. Auto-qualification, CRM upsert, Google Ads attribution bridge — closes the attribution gap from non-native form submissions.
Smart Translation Caching
DeepL calls fingerprinted and cached on Vercel KV (Redis) — calls only when text actually changes. Responses rebuilt from fresh CRM data plus cached translations, near-zero API cost. Exponential-backoff on outbound clients to absorb rate limits.
Multi-System Integration
Dynamics 365 (OAuth2 + custom actions), Webflow CMS v2, HubSpot (EU1), WhatConverts, DeepL, Upstash. JWT-authenticated webhooks, end-to-end attribution pipelines.
EU Production Infrastructure
Vercel with region-pinned deployments for EU latency. Sandbox/prod separation, distributed caching and locking. Custom booking engines, seasonal logic, multi-locale (including German umlaut slug normalization).
Agentic · Internal tool
SEO & Paid-Channel Fleet
Multi-tenant agent system for the agency's client roster. Persistent task board the agent manages itself across runs. 13 in-process tools — GSC queries, paid-channel audits, content briefs, article generation. End-to-end Webflow publishing on a weekly cadence. Same harness patterns from the open-source work, applied internally.
Featured System · Production live
Marketing platform
↔ Dynamics 365.
The Challenge
Production middleware between a modern marketing platform and Microsoft Dynamics 365 CRM. Distributed locking, realtime sync, EU-pinned deployment on Vercel. Runs daily in production.
The Architecture
POST /api/ingest"key": "sys_123_lock",
"ttl": 60,
"status": "acquired"
"text": "Complex Entity",
"source": "RAW",
"target": "NORM"
await crm.create(data)Open Source
Side projects, in public.
Four repos where I push the harness patterns further than the day job calls for. Public so I can refer back to them.
Brunnfeld
A medieval-village simulation with 19 LLM agents running on structured tool calls inside a deterministic game engine. Minimal instructions, rich environmental feedback. Same patterns I lean on commercially — just a more fun sandbox to push them in.

Reception
Brunnfeld got some discussion on r/Anthropic and r/BlackboxAI_. Most of it focused on the structured-tool-call setup rather than the medieval surface.
Sales Agent
Skill-based outbound automation on MCP. Pluggable CRM adapters, per-channel rate limits, never-invent-details rule, hard error stops, human-in-the-loop feedback.
Agent Factory
Autonomous system that picks real problems off Reddit/HN/GitHub and ships small specialized agents at them. Loop: discovery → scoring → build → ship.
Code Commander
Desktop command center for managing multiple AI coding agent sessions across codebases. Multi-agent orchestration over MCP. Built because I needed it.
FAQ
The obvious questions.
Common questions on agent harnesses, production reliability, and the gap between demo and prod.
MCP wins on standardization: one wire format, dropping new agents into existing tool surfaces, swapping models without rewriting integrations. Direct API integration wins on reliability — when the MCP server doesn't expose certain fields or endpoints, when you need custom retry/error semantics, when tool-call latency is on the critical path, or when vendor-specific edge cases (pagination quirks, idempotency keys, partial responses) get swallowed by the MCP wrapper. Rule of thumb: MCP for breadth, direct integration for the 2-3 critical tools that can't afford to flake. Most production setups end up mixed.
The harness is everything around the LLM that lets it do real work — tool schemas, validation, structured feedback, retry logic, safety limits. The model picks moves; the harness defines what moves are even possible and what happens when one fails. Two systems on the same model perform completely differently based on harness quality. That's where the engineering leverage actually sits.
RAG retrieves context to inform a single LLM call. An agent loops: it acts, observes, decides, and acts again — usually with multiple tool calls and external state changes. If the problem is "answer questions over our docs" → RAG. If it's "execute a multi-step workflow with our systems" → agent. Most production setups end up using both — RAG as a tool inside the agent's surface.
Compounding error. A 95% per-step success rate cascades to 36% over 20 steps. Demo paths run 3-5 steps under controlled conditions; production loops run 20+ steps over messy real data. The fix is in the harness: verification at each step, structured retries with backoff, hard stops on uncertain branches, escalation to humans at risk surfaces. Models don't get reliable — harnesses make them reliable.
MCP is a standard for how agents connect to tools and data sources. Think USB-C for agents — one wire format, many backends. Anthropic, OpenAI, and major frameworks now support it. If you're building agents that need to access multiple internal systems, yes — standardize on MCP. If you're building one tightly-scoped agent against one API, native integration is still fine. Don't migrate working code without a reason.
Three layers. (1) Trajectory tests — replay full agent runs against canonical inputs, assert on intermediate states, not just final outputs. (2) Tool-call evals — for each tool: happy-path, structured-error-path, adversarial inputs. (3) Production telemetry — log every tool call, retry, escalation, and per-loop cost. Production-ready means: trajectory tests pass at >95%, no unhandled tool errors in a 7-day prod window, cost-per-task within the budget envelope. Vibes are not a definition of done.
Engineering Logs
Notes from shipping.
Qwen3-Coder-Next Review: 70% SWE-Bench, Free, Runs Local
Free, open-source AI that scores 70.6% on SWE-Bench with only 3B active params. Runs on a Mac Mini, no API key needed. Tested against Opus 4.6 and GPT-5.
Serverless Race Conditions: Redis Locking (Next.js)
Serverless scales infinitely but your database does not. How to prevent race conditions and overselling using Redis and Vercel KV. No Kubernetes needed here.
Claude Opus 4.6 Fast Mode: 2.5x Faster, 6x More Expensive
Opus 4.6 Fast Mode costs $30/$150 per MTok versus $5/$25 standard. You pay 6x more for 2.5x speed. Full cost breakdown and competitive analysis inside.

Get in touch
Let's connect.
If you're working on something serious in agentic infrastructure — tool design, harness engineering, orchestration loops — drop me a line.