Marco Patzelt logoMarco Patzelt

Marco Patzelt — Software Engineer · AI Harness Layer

I build the harness.So agents own the workflow.

Tools, schemas, environments, feedback loops — the layer that wraps the LLM. Environment design over prompt engineering.

Marco Patzelt
ModelClaude
ProtocolMCP

Open Source · Receipts

Methodology, shipped publicly.

Four repositories. Same thesis — environment design over prompt engineering — tested across four layers.

123 · Multi-agent simulation as methodology proof

Brunnfeld

A multi-agent simulation testing environment-driven alignment. 19 LLM agents operate inside a deterministic game engine through structured tool calls — minimal instructions, rich environmental feedback. Methodology proof: the same patterns translate to commercial workflows.

2026 · TypeScript · Game EngineOpen repository

Reception

146K views.
Two threads.

Brunnfeld picked up on r/Anthropic and r/BlackboxAI_. The discussion landed on the methodology — environment-driven alignment, structured tool calls — not the medieval surface.

Invitation

Invited by Anthropic's Head of Community to the Code w/ Claude event off the back of this project.

Harness Layer.
Context · Tools · Memory · Sandbox · Feedback.

The real engineering work in agentic systems isn't writing better prompts — it's building the harness around the LLM. Tool schemas, structured feedback, validation, safety limits. Get the environment right, and agents can own production workflows end-to-end.

Context
Engineering

Token budget as design parameter

The context window is a scarce resource. Perception schemas at the boundary, trajectory compression on recall, deterministically generated world snapshots. The agent sees what you let it see — nothing more.

Structured perception

JSON in, JSON out. Engine validates. No hallucination at the interface — the snapshot is built deterministically, not assembled by the model.

import { NextResponse } from 'next/server';

import { createClient } from '@supabase/ssr';

export async function POST(req: Request) {

// Typsichere Validierung

const body = await req.json();

const { id } = Schema.parse(body);

const supabase = createClient();

const { data } = await supabase

.from('events')

.select('*');

return NextResponse.json(data);

}

Harness Runtime

Tool Buseu-central-1
24ms
Memory Storevector + episodic
12ms
FEEDBACK LOOP

Agent Memory

Working · episodic · semantic. Hybrid and persistent — the agent builds context from its own experience.

The principle: Memory is queried, not pushed into the prompt. Slices on demand instead of full dump. Skills from experience, not from the system prompt — the agent builds its own context.

Tools · Sandbox · Feedback

Structured tool calls, ephemeral sandboxes, errors as feedback signal.

The result: 30s per call, 20min per loop, ephemeral per sandbox. Verification + retry with backoff. When 20 steps cascade at 95% each you land at 36% — the harness catches that.

03
End-to-end Engineering

From schema to production code.

I work the full stack — schemas, harness logic, deployment, frontend. No hand-offs between specialists, no integration drift. One engineer, every layer.

Schema → UI

Agent output types flow directly into React props. Same TypeScript end-to-end — from tool-call schema to rendered element. No drift between backend and UI.

Schema → UI
Type-safe end-to-end

Streaming State

Agent thinking, tool calls, partial edits — live in the UI. SSE streams, Canvas rendering, structured status events. The user sees what the agent does, while it does it.

Streaming + Tool-Call State
Real-time agent rendering

Interface Layer

The frontend isn't a coat of paint — it's the interface between agent and human. Where approvals happen, where edits get made, where trust is won or lost. Production surface, not decoration.

Frontend as Production Surface
Agent-to-human interface

Day Job

Software & Integration Engineer.

Production middleware on Microsoft Dynamics 365, Webflow CMS, HubSpot, and adjacent enterprise systems — real-time sync, distributed locking, EU-pinned deployment. Plus an internal agentic fleet for SEO and paid-channel automation across the agency's client roster. Where the harness expertise originally came from is up in the open-source section — Brunnfeld and the other agentic systems built from the inside.

Full Ownership

Requirements → architecture → implementation → deployment. Direct technical contact with the client, no hand-offs between specialists.

Real-time Sync Middleware

Webhook-driven sync between Microsoft Dynamics 365 and Webflow CMS. Async processing via Vercel waitUntil(), Redis-based distributed locks to prevent race conditions on concurrent webhook deliveries.

Dynamics 365WebflowWebflowVercelVercelUpstashUpstash

Lead-Capture & Qualification API

Stateless serverless API powering multi-step quiz funnels. Auto-qualification, CRM upsert, Google Ads attribution bridge — closes the attribution gap from non-native form submissions.

VercelVercelGoogle AdsGoogle AdsWhatConverts

Smart Translation Caching

DeepL calls fingerprinted and cached on Vercel KV (Redis) — calls only when text actually changes. Responses rebuilt from fresh CRM data plus cached translations, near-zero API cost. Exponential-backoff on outbound clients to absorb rate limits.

DeepLDeepLVercelVercelUpstashUpstash

Multi-System Integration

Dynamics 365 (OAuth2 + custom actions), Webflow CMS v2, HubSpot (EU1), WhatConverts, DeepL, Upstash. JWT-authenticated webhooks, end-to-end attribution pipelines.

Dynamics 365WebflowWebflowHubSpotHubSpotWhatConvertsDeepLDeepLUpstashUpstash

EU Production Infrastructure

Vercel with region-pinned deployments for EU latency. Sandbox/prod separation, distributed caching and locking. Custom booking engines, seasonal logic, multi-locale (including German umlaut slug normalization).

VercelVercel

Agentic · Internal tool

SEO & Paid-Channel Fleet

Multi-tenant agent system for the agency's client roster. Persistent task board the agent manages itself across runs. 13 in-process tools — GSC queries, paid-channel audits, content briefs, article generation. End-to-end Webflow publishing on a weekly cadence. Same harness patterns from the open-source work, applied internally.

Next.jsNext.jsSupabaseSupabaseOpenRouterOpenRouterWebflowWebflow

Featured System · Production live

Marketing platform
↔ Dynamics 365.

The Challenge

Production middleware between a modern marketing platform and Microsoft Dynamics 365 CRM. Distributed locking, realtime sync, EU-pinned deployment on Vercel. Runs daily in production.

The Architecture

Vercel
Middleware Engineering
Node.js on Vercel — decoupling frontend from CRM logic.
Data Synchronization
Dynamics 365 entity reconciliation in real-time.
Upstash
High-Performance Caching
Upstash (Redis) to minimize latency.
DeepL
Processing Pipeline
Automated content handling.
FrontendEnterprise Core
Webflow
Vercel
Middleware
Vercel Serverless
POST /api/ingest
Upstash
Redis Lock

"key": "sys_123_lock",

"ttl": 60,

"status": "acquired"

DeepL
Processing API

"text": "Complex Entity",

"source": "RAW",

"target": "NORM"

Core Sync
Dynamics 365
await crm.create(data)

FAQ

The obvious questions.

Common questions on agent harnesses, production reliability, and the gap between demo and prod.

MCP wins on standardization: one wire format, dropping new agents into existing tool surfaces, swapping models without rewriting integrations. Direct API integration wins on reliability — when the MCP server doesn't expose certain fields or endpoints, when you need custom retry/error semantics, when tool-call latency is on the critical path, or when vendor-specific edge cases (pagination quirks, idempotency keys, partial responses) get swallowed by the MCP wrapper. Rule of thumb: MCP for breadth, direct integration for the 2-3 critical tools that can't afford to flake. Most production setups end up mixed.

The harness is everything around the LLM that lets it do real work — tool schemas, validation, structured feedback, retry logic, safety limits. The model picks moves; the harness defines what moves are even possible and what happens when one fails. Two systems on the same model perform completely differently based on harness quality. That's where the engineering leverage actually sits.

RAG retrieves context to inform a single LLM call. An agent loops: it acts, observes, decides, and acts again — usually with multiple tool calls and external state changes. If the problem is "answer questions over our docs" → RAG. If it's "execute a multi-step workflow with our systems" → agent. Most production setups end up using both — RAG as a tool inside the agent's surface.

Compounding error. A 95% per-step success rate cascades to 36% over 20 steps. Demo paths run 3-5 steps under controlled conditions; production loops run 20+ steps over messy real data. The fix is in the harness: verification at each step, structured retries with backoff, hard stops on uncertain branches, escalation to humans at risk surfaces. Models don't get reliable — harnesses make them reliable.

MCP is a standard for how agents connect to tools and data sources. Think USB-C for agents — one wire format, many backends. Anthropic, OpenAI, and major frameworks now support it. If you're building agents that need to access multiple internal systems, yes — standardize on MCP. If you're building one tightly-scoped agent against one API, native integration is still fine. Don't migrate working code without a reason.

Three layers. (1) Trajectory tests — replay full agent runs against canonical inputs, assert on intermediate states, not just final outputs. (2) Tool-call evals — for each tool: happy-path, structured-error-path, adversarial inputs. (3) Production telemetry — log every tool call, retry, escalation, and per-loop cost. Production-ready means: trajectory tests pass at >95%, no unhandled tool errors in a 7-day prod window, cost-per-task within the budget envelope. Vibes are not a definition of done.

Engineering Logs

Notes from shipping.

All posts

Get in touch

Let's connect.

If you're working on something serious in agentic infrastructure — tool design, harness engineering, orchestration loops — drop me a line.