Building applications that make agents
more successful at tasks
Claude Code · Skills & Plugins
AI agents are general-purpose — they can attempt anything.
That's also the problem.
Without specific instructions, agents wing it. They'll produce something plausible. Skills are how you get something good.
Skills range from deterministic operations
(take a screenshot, format for Teams)
to open-ended exploration
(find missing edge cases, help a human think)
A skill is a markdown file that programs Claude Code for a specific task.
---
name: my-skill
description: What it does and when to use it.
---
## Instructions for the agent
Tables, workflows, decision trees, gates —
whatever the task needs.
No SDK. No build step. No runtime.
Just a SKILL.md and the knowledge to write it well.
Distribution is a git repo. Install with /plugin install.
A skill is one markdown file. A plugin is how it ships.
A single markdown file — instructions that program the AI for one task.
skills/
my-skill/
SKILL.md
The distribution package — bundles skills together, optionally with code, MCP servers, agents, and hooks.
plugin/
.claude-plugin/plugin.json
commands/
skills/
agents/
references/
scripts/
src/
---
name: design-for-ai
description: >
Visual design principles from
Design for Hackers. Use when
building or improving UI/frontend
design — choosing fonts, building
color systems, establishing design
direction, auditing existing designs,
or polishing before shipping.
user-invocable: true
argument-hint: "[design|fonts|color|audit|polish]"
---
The description does the most work.
It's the only part always in the agent's context. It determines whether the skill triggers at all.
Standard fields: name, description
Claude Code extensions: when_to_use, argument-hint, user-invocable, model, context, hooks, paths, allowed-tools
Eight patterns. Which one are you building?
One entry point, many handlers. The routing table maps natural language to structured tool calls.
ado:ask routes 30+ operations across 4 MCP tools:
## Routing
Parse the user's intent and call the matching tool.
| User intent | Tool | Action | Args | Flags |
|--------------------------|-----------------|-----------|--------------|--------------------------------|
| Show work item 12345 | `ado_boards` | `show` | `["12345"]` | |
| My work items | `ado_boards` | `mine` | `[]` | |
| Search for "login bug" | `ado_boards` | `search` | `["login…"]` | |
| Create a bug | `ado_boards` | `create` | `[]` | `--type Bug --title "desc"` |
| List pipelines | `ado_pipelines` | `list` | `[]` | |
| Trigger a pipeline | `ado_pipelines` | `run` | `["pipe-id"]`| `--branch main` |
| Read wiki page | `ado_wiki` | `read` | `["name",…]` | |
## Critical Rules
**Write operations require user confirmation before calling the tool.**
- `ado_boards`: create, update, link, bulk-update
- `ado_pipelines`: run, cancel
- `ado_wiki`: create, update, delete
30+ operations across 4 MCP tools, all sharing one interface: { action, args, flags }
Tasks aren't tracking here. They're enforcement. build creates every phase task upfront with dependency chains, then the execution loop is state-locked. Nothing moves until the predecessor completes.
### Create Phase Tasks Upfront
For each phase N, detect gate policy, then create tasks:
**Full gate (2 tasks):**
TaskCreate("Phase N.1: BUILD - [name]", activeForm: "Building Phase N")
TaskCreate("Phase N.2: REVIEW - [name]", activeForm: "Reviewing Phase N")
**Standard gate (1 task):**
TaskCreate("Phase N.1: BUILD - [name]", activeForm: "Building Phase N")
**Chain dependencies:**
- Full gate: N.2 blockedBy N.1. Next phase blockedBy N.2.
- Standard/Minimal: Next phase blockedBy N.1.
## Execution Loop
For each task:
1. TaskGet(task_id) → verify blockedBy list is empty
2. TaskUpdate(task_id, status: "in_progress")
3. Dispatch subagent (build-agent or post-gate-agent)
4. Wait for completion
5. If FAIL:
→ Do NOT mark completed
→ Follow Gate Failure Protocol (max 3 retries, then escalate)
6. If success:
→ TaskUpdate(task_id, status: "completed")
→ Commit (trailers: Gate-Policy, Review outcome, AI-Epistemic-Status)
7. Proceed to next task
The task stays in_progress on failure — blockedBy prevents downstream tasks from starting. Three failures → mandatory user escalation. Only the orchestrator manages task state; subagents never touch it.
Loads other skills as checklists, then runs multi-dimensional reviews demanding evidence for every line item.
The post-gate-agent review sequence:
## STOP — Load Standards and Checklists
Read the post-gate review standards:
1. `Read($CLAUDE_PLUGIN_ROOT/references/post-gate-standards.md)`
Then follow every `Read()` directive in that file.
## STOP — Load Skills as Checklists
If the dispatch prompt includes `## Additional Skills`, load each:
1. `Skill([skill-name])` — loads SKILL.md content
2. If `checklists.md` exists → `Read()` it
3. If `checklists/` directory exists → `Read()` every file
## Review Steps
### 1. Requirement Fulfillment (Done-When Verification)
For each DW item:
- Find concrete evidence (file:line, test, observable behavior)
- Mark: SATISFIED (with evidence) or NOT_SATISFIED (with what's missing)
| DW-ID | Done-When Item | Status | Evidence |
|--------|-------------------------|---------------|------------------|
| DW-1.1 | API returns 200 on GET | SATISFIED | api.test.ts:42 |
| DW-1.2 | Rate limiting at 100/hr | NOT_SATISFIED | No test coverage |
**ANY item NOT_SATISFIED → FAIL.**
### 2. Test-DW Coverage
### 3. Correctness Verification
### 4. Defensive Programming
Input goes in, rules get applied, output comes out. One pass. The ruleset does all the work.
write tackles AI prose detection with five structural rules:
## The Problem
Next-token prediction selects against surprise. RLHF narrows output
toward a bland center. The result is prose that reads like a committee
voted on every sentence.
## Core Rules (always active)
These five rules address the structural signals that blind-test
research identified as hardest to fake and most robust for detection.
### 1. Lurch
Vary sentence length violently. Shortest under five words.
Longest over thirty. Never three consecutive sentences within
five words of each other.
### 2. Spike
Vary information density across paragraphs. Pack one tight.
Let the next breathe — one idea, circled slowly.
### 3. Wander
Don't follow the outline. Start with what's interesting.
Circle back. Digress.
### 4. Shift Register
Move between precise and casual within a piece.
Technical for a sentence, then conversational.
### 5. Get Specific
Never write for everyone. Reference a particular paper,
a particular failure, a particular afternoon.
The classification IS the work. Is the goal unclear, the assumption wrong, or the details missing? Each fault type triggers a different question.
clarify diagnoses the request itself:
## Classifying What's Unclear
### Fault Types
**Intention faults** — The real goal isn't recoverable from the request.
- Indirect intent: "Can you check if this is possible?" (means "do this")
- Vague objectives: "Make it better" (better how? for whom?)
**Premise faults** — An assumption in the request is wrong.
- False presupposition: "Fix the race condition" (no race condition exists)
**Parameter faults** — Required details are missing or conflicting.
- "Build a login page" (OAuth? email/password? SSO?)
**Expression faults** — Language prevents unique interpretation.
- "Update that component" (which one?)
### Ambiguity Direction
| Direction | Signal | Action |
|----------------|------------------------------|-------------------------------------|
| **Semantic** | Key terms have multiple meanings | "do you mean A or B?" |
| **Too broad** | Clear intent but scope is huge | "which part matters most now?" |
| **Too narrow** | Oddly specific for the goal | "what's the broader outcome?" |
## Generating Questions
### Think in Hypotheses
1. Generate 2-4 competing interpretations of the request
2. Identify the axis of disagreement
3. Ask about that axis
Zero domain logic in the skill. The MCP server does the actual work. All the skill does is translate "make this look good in Teams" into the right call with the right parameters.
Here's the entire penman skill. 37 lines
---
description: Convert markdown to platform-styled rich text and copy
to clipboard. Pass the platform as a flag (e.g. `--slack`, `--teams`,
`--notion`) and the source as a file path, inline text, or nothing.
argument-hint: --<platform> [--dark|--light] [file-or-text]
---
## How to handle the request
1. **Parse `$ARGUMENTS`** into:
- `platform` — the first `--<name>` flag that isn't `--dark`/`--light`
- `theme` — `--dark` or `--light` if present, otherwise unset
- `source` — everything else. File path, inline markdown, or absent.
2. **Resolve the platform.** If no `--<platform>` flag, call
`mcp__penman__penman_platforms` to get the live list, then ask.
3. **Resolve the markdown content:**
- If `source` is an existing file path → Read it.
- If `source` is non-empty inline text → use it directly.
- If `source` is empty → AskUserQuestion:
"Clipboard" | "File path" | "Paste inline"
4. **Call the MCP.** Invoke `mcp__penman__penman` with
`markdown`, `platform`, and `theme` (only if explicitly set).
5. **Report** the tool's response verbatim.
## Examples
- `/penman:pen --teams ./notes.md`
- `/penman:pen --slack --dark`
- `/penman:pen --notion # Hello\n\nworld`
"Make this look good in Teams" → right MCP call, right params, clipboard. That's the whole skill.
Not a workflow. A lens. After loading, the agent thinks differently about every decision it makes for the rest of the session.
design-for-ai loads typography, color theory, composition, and proportions into context:
## Routing
| Mode | User says something like |
|--------|-------------------------------------------------------------|
| design | "Starting a project" / "what direction" / "who is this for" |
| fonts | "Pick fonts" / "typography" / "type scale" |
| color | "Colors" / "palette" / "color scheme" |
| audit | "Something's off" / "review this" / "why does it look AI" |
| polish | "Almost done" / "final pass" / "make it less generic" |
## audit
Read `${CLAUDE_SKILL_DIR}/references/checklists.md`
Work through each section. For the **top 2-3 with worst findings**,
load the chapter reference to ground the diagnosis:
| Section | Reference file |
|----------------------|-----------------------------------------------------|
| Typography | `chapter-03-typography.md`, `appendix-fonts.md` |
| Proportions & layout | `chapter-05-proportions.md` |
| Composition | `chapter-06-composition.md` |
| Color | `chapter-08-color-science.md`, `chapter-09-theory.md`|
| Design identity | `ai-tells.md` |
Output: findings table by severity with principle citations.
This presentation was designed with it. The font pairing, color system, spacing — all grounded in theory.
The output isn't code or a report. It's discovered requirements that didn't exist before the conversation. Ask, listen, reflect, ask deeper.
research uses progressive narrowing to help users articulate what they want:
Help the user figure out what they want and get it written down.
They might arrive with a vague vision, a half-formed idea, or just
a problem they feel. Your job is facilitation.
## How You Talk
**Short turns.** A sentence or two of observation, then a question.
Not a paragraph of analysis.
**Have opinions.** "That sounds like a notification problem more than
a feed problem" is useful. "There are several ways to think about
this" is not. Be wrong sometimes — it's faster than being neutral.
**Match their energy.** If they're terse, be terse. If they're
thinking out loud, think with them.
**No preamble.** Don't announce what you're about to do.
## Progressive Narrowing
Each question should make the problem space smaller.
**Purpose** — Why does this need to exist?
**Actors** — Who uses it? Who benefits? Who pays?
**Context** — What exists today? What's the current pain?
**Boundaries** — What's explicitly out of scope?
**Needs** — What must it do? Priority order?
**Risks** — What must be true for this to work?
The only pattern no framework literature has named. They all focus on agents that do work. Nobody writes about agents that help humans think.
| Question | Pattern |
|---|---|
| Does the task need input classified before acting? | Router |
| Is it multi-step with dependencies? | Workflow |
| Does it need to verify output quality? | Validator |
| Is it rules → input → output in one pass? | Transformer |
| Does it diagnose problems? | Diagnostic |
| Does it make an external tool usable? | Tool Wrapper |
| Does it load a mental model into context? | Reference Frame |
| Does it help the user articulate what they want? | Facilitator |
Patterns compose inside a skill. Skills compose with each other.
| Skill | Patterns |
|---|---|
web-research | Router + Workflow |
post-gate-agent | Validator + Ref Frame |
design-for-ai | Ref Frame + Router |
plan | Workflow + Diagnostic |
diagnose | Diagnostic + Tool Wrapper |
Skills can load other skills at runtime. Combine them and you get capabilities neither has alone.
| Combination | What you get |
|---|---|
debug + performance | Debugging guided by profiling methodology |
build + post-gate + ref frames | Implementation reviewed against domain checklists |
plan + clarify | Requirements discovered before architecture begins |
The research on what actually works.
Dump everything into context and 11 out of 13 models drop below baseline at 32K tokens. Load selectively and you get 3.5x better results. Three levels is all you need.
| Level | What | Size | When loaded |
|---|---|---|---|
| 1 | description |
~100 tokens | Always in context |
| 2 | SKILL.md body | <500 lines | On trigger |
| 3 | references/ |
Unlimited | On demand |
Real example from design-for-ai:
# Level 1 — always in context (~100 tokens)
description: "Visual design principles from Design for Hackers.
Use when building or improving UI/frontend design..."
# Level 2 — SKILL.md body, loaded on trigger (~200 lines)
## Routing
| Mode | User says |
|-------|------------------------------|
| audit | "something's off" / "review" |
| fonts | "pick fonts" / "typography" |
## audit
Read `${CLAUDE_SKILL_DIR}/references/checklists.md`
For top 2-3 worst sections, load the chapter reference...
# Level 3 — references/, loaded on demand (unlimited)
# references/chapter-03-typography.md (~350 lines)
# references/chapter-09-color-theory.md (~380 lines)
# references/ai-tells.md (~360 lines)
# ... 12 reference files, loaded ONLY when needed
MEM1 (2025) — selective loading, 3.5x performance at 3.7x less memory. NoLiMa (ICML 2025) — 11/13 models below 50% baseline at 32K. Anthropic — compaction preserves first 5,000 tokens per skill, 25,000 budget across all loaded skills.
Tell the agent how to answer before asking it to reason. It skips the reasoning and jumps straight to the answer. VISTA (2026)
Structure your skill so the model thinks before it answers.
The scarier finding: when LLMs self-reflect on failures, zero structural attributions across all configurations. They can't see their own structural problems — only humans can audit the skeleton.
FSE 2025, production templates at Uber/Microsoft
This applies everywhere: workflow phases build understanding before decisions, output schemas have reasoning before conclusions, gates gather evidence before verdicts.
Write your constraints in a paragraph. By paragraph three, the model has forgotten them. NLD-P (2026)
# ❌ Prose (dissolves)
If the user wants a quick answer, use scan mode. If they want a standard
research question, use brief mode. For a deep dive with verification...
# ✅ Table (holds) — from web-research
## Depth Modes
| Mode | When | Output |
|-----------|-------------------------------|-------------------------------------|
| **scan** | Quick answer, sanity check | 1-pager: top 3 findings + URLs |
| **brief** | Standard research question | Synthesized brief w/ recommendations |
| **breadth** | Map a space, survey options | Landscape: categories, players, gaps |
| **deep** | Decision-critical, needs proof| Full report, confidence levels |
# ✅ Gates (sequence constraints) — from post-gate-agent
## STOP — Load Standards and Checklists
Read `$PLUGIN_ROOT/references/post-gate-standards.md`
Then follow every `Read()` directive in that file.
## STOP — Load Skills as Checklists
If dispatch includes `## Additional Skills`, load each listed skill.
## Review Steps ← only now does the actual work begin
Mittal (2026) — each additional simultaneous constraint reduces compliance 2–21%.
FSE 2025 — explicit exclusion constraints: format-following 40% → 100%.
Move a critical instruction from position 1 to position 10. 30%+ accuracy drop. The model pays attention to the beginning and the end. The middle is a graveyard. Liu et al. (TACL 2024), confirmed by OpenAI, Anthropic, Google.
From write — rules at the top, operationalized at the bottom:
# ↑ TOP OF SKILL — state the rules with research backing
## Core Rules (always active)
These five rules address the structural signals that blind-test
research identified as hardest to fake and most robust for detection.
### 1. Lurch — Vary sentence length violently.
### 2. Spike — Vary information density across paragraphs.
### 3. Wander — Don't follow the outline.
### 4. Shift Register — Move between precise and casual.
### 5. Get Specific — Reference a particular paper, failure, afternoon.
# ... 60 lines of surface rules, deep craft, examples ...
# ↓ BOTTOM OF SKILL — operationalize the same rules as a checklist
## Self-Check (run silently before finalizing)
1. Sentence length range — shortest vs longest. <20-word gap? Fix.
2. Three consecutive same-length sentences? Break one.
3. Register — did you shift at least twice?
4. Kill list — scan for banned words from surface rules.
5. Density — every paragraph same density? Compress one, stretch another.
6. Specificity — at least one concrete reference a generic model wouldn't?
7. Structure — could someone predict the org from paragraph 1? Rearrange.
State the rules at the top. Operationalize them as a checklist at the bottom. Skimmers hit the rules. Finishers hit the checklist. Nobody escapes.
Mix your rules into the task instructions. Update the model. Watch the rules vanish. NLD-P (2026), HIPO (2026)
From build-agent — governance sections you could extract and reuse on any task:
# GOVERNANCE — applies to ANY phase, ANY codebase
## STOP — Load Standards and Checklists ← Layer 1: Identity
Before any work, read both standards files:
1. `Read($PLUGIN_ROOT/references/pre-gate-standards.md)`
2. `Read($PLUGIN_ROOT/references/implement-standards.md)`
Then follow every `Read()` directive in those files.
## STOP — Read Input Files First ← Layer 2: Constraints
| Source | Purpose | Required |
|----------------------|--------------------------|----------|
| Discovery + Design | What exists, gaps, decisions | YES |
| Plan file | Requirements context | YES |
## STOP — Load Skills and Checklists ← Layer 2: Constraints
If dispatch includes `## Additional Skills`, load each.
# ──────────────────────────────────────────────────
# TASK — the actual work, isolated below
## Phase 1: Discovery + Design ← Layer 3: Task
Apply the pre-gate standards (design-it-twice, depth eval, skip criteria).
### Scope the Phase
- [ ] Do the files listed in the plan exist?
- [ ] Read each file. Note current state.
## Phase 2: TDD Implementation ← Layer 3: Task
...
The test: can you pull out the governance sections and apply them to a completely different task? If yes, they're properly separated.
Testing that skills work — and getting them to people.
Skills exist to make the agent more successful at a task.
Test that it does the thing.
| Task type | Goal | Verifiability |
|---|---|---|
| Deterministic screenshot, deploy, format |
Same result every time | Easiest |
| Guided code review, debugging |
Follows the methodology | Medium |
| Exploratory edge cases, research |
Covers ground the agent wouldn't find alone | Hardest |
Manual testing is the current answer. Eval frameworks are emerging.
| Method | How it works | Best for |
|---|---|---|
| Project skills .claude/skills/ |
Drop a SKILL.md in the repo | Project-specific knowledge, team conventions |
| Plugin install /plugin install |
Git repo with manifest, /plugin install author@name |
Reusable tools, shared across projects |
Project skills are fast. Plugins are distributable.
Commands are namespaced (/plugin:command). Skills auto-trigger by description match.
| Pattern | Typical contents |
|---|---|
| Facilitator, Reference Frame | SKILL.md only — no code, no MCP |
| Router, Workflow, Validator | SKILL.md + references/ + maybe agents/ |
| Transformer, Diagnostic | SKILL.md + references/ (rulesets, taxonomies) |
| Tool Wrapper | SKILL.md + MCP server + src/ + deps |
Simple patterns ship easily. The powerful patterns — MCP servers, hooks, subagent dispatch — need the full plugin system.
Penman is 37 lines. The build pipeline enforces multi-phase workflows through task dependencies. Design-for-ai changed how this presentation was designed. A markdown file can do all of that.