Skills:
Programming the AI

Building applications that make agents
more successful at tasks

Claude Code · Skills & Plugins

The problem

AI agents are general-purpose — they can attempt anything.
That's also the problem.

Without specific instructions, agents wing it. They'll produce something plausible. Skills are how you get something good.

Skills range from deterministic operations
(take a screenshot, format for Teams)
to open-ended exploration
(find missing edge cases, help a human think)

What you're building

A skill is a markdown file that programs Claude Code for a specific task.

---
name: my-skill
description: What it does and when to use it.
---

## Instructions for the agent

Tables, workflows, decision trees, gates —
whatever the task needs.

No SDK. No build step. No runtime.
Just a SKILL.md and the knowledge to write it well.

Distribution is a git repo. Install with /plugin install.

01

Anatomy

A skill is one markdown file. A plugin is how it ships.

Skill vs Plugin

Skill

A single markdown file — instructions that program the AI for one task.

skills/
  my-skill/
    SKILL.md

Plugin

The distribution package — bundles skills together, optionally with code, MCP servers, agents, and hooks.

plugin/
  .claude-plugin/plugin.json
  commands/
  skills/
  agents/
  references/
  scripts/
  src/

Frontmatter — the contract

---
name: design-for-ai
description: >
  Visual design principles from
  Design for Hackers. Use when
  building or improving UI/frontend
  design — choosing fonts, building
  color systems, establishing design
  direction, auditing existing designs,
  or polishing before shipping.
user-invocable: true
argument-hint: "[design|fonts|color|audit|polish]"
---

The description does the most work.

It's the only part always in the agent's context. It determines whether the skill triggers at all.

Standard fields: name, description
Claude Code extensions: when_to_use, argument-hint, user-invocable, model, context, hooks, paths, allowed-tools

02

Patterns

Eight patterns. Which one are you building?

Eight patterns

Router — Pick the right path
Workflow — Sequence work
Validator — Check work against criteria
Transformer — Apply rules to input
Diagnostic — Classify a problem
Tool Wrapper — Use a tool properly
Reference Frame — Think with a framework
Facilitator — Help the human think

Router — Pick the right path

One entry point, many handlers. The routing table maps natural language to structured tool calls.

ado:ask routes 30+ operations across 4 MCP tools:

## Routing

Parse the user's intent and call the matching tool.

| User intent              | Tool            | Action    | Args         | Flags                          |
|--------------------------|-----------------|-----------|--------------|--------------------------------|
| Show work item 12345     | `ado_boards`    | `show`    | `["12345"]`  |                                |
| My work items            | `ado_boards`    | `mine`    | `[]`         |                                |
| Search for "login bug"   | `ado_boards`    | `search`  | `["login…"]` |                                |
| Create a bug             | `ado_boards`    | `create`  | `[]`         | `--type Bug --title "desc"`    |
| List pipelines           | `ado_pipelines` | `list`    | `[]`         |                                |
| Trigger a pipeline       | `ado_pipelines` | `run`     | `["pipe-id"]`| `--branch main`               |
| Read wiki page           | `ado_wiki`      | `read`    | `["name",…]` |                                |

## Critical Rules

**Write operations require user confirmation before calling the tool.**
- `ado_boards`: create, update, link, bulk-update
- `ado_pipelines`: run, cancel
- `ado_wiki`: create, update, delete

30+ operations across 4 MCP tools, all sharing one interface: { action, args, flags }

Workflow — Sequence work

Tasks aren't tracking here. They're enforcement. build creates every phase task upfront with dependency chains, then the execution loop is state-locked. Nothing moves until the predecessor completes.

### Create Phase Tasks Upfront

For each phase N, detect gate policy, then create tasks:

**Full gate (2 tasks):**
  TaskCreate("Phase N.1: BUILD - [name]", activeForm: "Building Phase N")
  TaskCreate("Phase N.2: REVIEW - [name]", activeForm: "Reviewing Phase N")

**Standard gate (1 task):**
  TaskCreate("Phase N.1: BUILD - [name]", activeForm: "Building Phase N")

**Chain dependencies:**
- Full gate: N.2 blockedBy N.1. Next phase blockedBy N.2.
- Standard/Minimal: Next phase blockedBy N.1.

## Execution Loop

For each task:
1. TaskGet(task_id) → verify blockedBy list is empty
2. TaskUpdate(task_id, status: "in_progress")
3. Dispatch subagent (build-agent or post-gate-agent)
4. Wait for completion
5. If FAIL:
   → Do NOT mark completed
   → Follow Gate Failure Protocol (max 3 retries, then escalate)
6. If success:
   → TaskUpdate(task_id, status: "completed")
   → Commit (trailers: Gate-Policy, Review outcome, AI-Epistemic-Status)
7. Proceed to next task

The task stays in_progress on failure — blockedBy prevents downstream tasks from starting. Three failures → mandatory user escalation. Only the orchestrator manages task state; subagents never touch it.

Validator — Check work against criteria

Loads other skills as checklists, then runs multi-dimensional reviews demanding evidence for every line item.

The post-gate-agent review sequence:

## STOP — Load Standards and Checklists

Read the post-gate review standards:
1. `Read($CLAUDE_PLUGIN_ROOT/references/post-gate-standards.md)`

Then follow every `Read()` directive in that file.

## STOP — Load Skills as Checklists

If the dispatch prompt includes `## Additional Skills`, load each:
1. `Skill([skill-name])` — loads SKILL.md content
2. If `checklists.md` exists → `Read()` it
3. If `checklists/` directory exists → `Read()` every file

## Review Steps

### 1. Requirement Fulfillment (Done-When Verification)

For each DW item:
- Find concrete evidence (file:line, test, observable behavior)
- Mark: SATISFIED (with evidence) or NOT_SATISFIED (with what's missing)

| DW-ID  | Done-When Item          | Status        | Evidence         |
|--------|-------------------------|---------------|------------------|
| DW-1.1 | API returns 200 on GET  | SATISFIED     | api.test.ts:42   |
| DW-1.2 | Rate limiting at 100/hr | NOT_SATISFIED | No test coverage |

**ANY item NOT_SATISFIED → FAIL.**

### 2. Test-DW Coverage
### 3. Correctness Verification
### 4. Defensive Programming

Transformer — Apply rules to input

Input goes in, rules get applied, output comes out. One pass. The ruleset does all the work.

write tackles AI prose detection with five structural rules:

## The Problem

Next-token prediction selects against surprise. RLHF narrows output
toward a bland center. The result is prose that reads like a committee
voted on every sentence.

## Core Rules (always active)

These five rules address the structural signals that blind-test
research identified as hardest to fake and most robust for detection.

### 1. Lurch
Vary sentence length violently. Shortest under five words.
Longest over thirty. Never three consecutive sentences within
five words of each other.

### 2. Spike
Vary information density across paragraphs. Pack one tight.
Let the next breathe — one idea, circled slowly.

### 3. Wander
Don't follow the outline. Start with what's interesting.
Circle back. Digress.

### 4. Shift Register
Move between precise and casual within a piece.
Technical for a sentence, then conversational.

### 5. Get Specific
Never write for everyone. Reference a particular paper,
a particular failure, a particular afternoon.

Diagnostic — Classify a problem

The classification IS the work. Is the goal unclear, the assumption wrong, or the details missing? Each fault type triggers a different question.

clarify diagnoses the request itself:

## Classifying What's Unclear

### Fault Types

**Intention faults** — The real goal isn't recoverable from the request.
- Indirect intent: "Can you check if this is possible?" (means "do this")
- Vague objectives: "Make it better" (better how? for whom?)

**Premise faults** — An assumption in the request is wrong.
- False presupposition: "Fix the race condition" (no race condition exists)

**Parameter faults** — Required details are missing or conflicting.
- "Build a login page" (OAuth? email/password? SSO?)

**Expression faults** — Language prevents unique interpretation.
- "Update that component" (which one?)

### Ambiguity Direction

| Direction      | Signal                       | Action                              |
|----------------|------------------------------|-------------------------------------|
| **Semantic**   | Key terms have multiple meanings | "do you mean A or B?"           |
| **Too broad**  | Clear intent but scope is huge   | "which part matters most now?"  |
| **Too narrow** | Oddly specific for the goal      | "what's the broader outcome?"   |

## Generating Questions

### Think in Hypotheses

1. Generate 2-4 competing interpretations of the request
2. Identify the axis of disagreement
3. Ask about that axis

Tool Wrapper — Use a tool properly

Zero domain logic in the skill. The MCP server does the actual work. All the skill does is translate "make this look good in Teams" into the right call with the right parameters.

Here's the entire penman skill. 37 lines

---
description: Convert markdown to platform-styled rich text and copy
  to clipboard. Pass the platform as a flag (e.g. `--slack`, `--teams`,
  `--notion`) and the source as a file path, inline text, or nothing.
argument-hint: --<platform> [--dark|--light] [file-or-text]
---
## How to handle the request

1. **Parse `$ARGUMENTS`** into:
   - `platform` — the first `--<name>` flag that isn't `--dark`/`--light`
   - `theme` — `--dark` or `--light` if present, otherwise unset
   - `source` — everything else. File path, inline markdown, or absent.

2. **Resolve the platform.** If no `--<platform>` flag, call
   `mcp__penman__penman_platforms` to get the live list, then ask.

3. **Resolve the markdown content:**
   - If `source` is an existing file path → Read it.
   - If `source` is non-empty inline text → use it directly.
   - If `source` is empty → AskUserQuestion:
     "Clipboard" | "File path" | "Paste inline"

4. **Call the MCP.** Invoke `mcp__penman__penman` with
   `markdown`, `platform`, and `theme` (only if explicitly set).

5. **Report** the tool's response verbatim.

## Examples
- `/penman:pen --teams ./notes.md`
- `/penman:pen --slack --dark`
- `/penman:pen --notion # Hello\n\nworld`

"Make this look good in Teams" → right MCP call, right params, clipboard. That's the whole skill.

Reference Frame — Think with a framework

Not a workflow. A lens. After loading, the agent thinks differently about every decision it makes for the rest of the session.

design-for-ai loads typography, color theory, composition, and proportions into context:

## Routing

| Mode   | User says something like                                   |
|--------|-------------------------------------------------------------|
| design | "Starting a project" / "what direction" / "who is this for" |
| fonts  | "Pick fonts" / "typography" / "type scale"                  |
| color  | "Colors" / "palette" / "color scheme"                       |
| audit  | "Something's off" / "review this" / "why does it look AI"  |
| polish | "Almost done" / "final pass" / "make it less generic"       |

## audit

Read `${CLAUDE_SKILL_DIR}/references/checklists.md`

Work through each section. For the **top 2-3 with worst findings**,
load the chapter reference to ground the diagnosis:

| Section              | Reference file                                      |
|----------------------|-----------------------------------------------------|
| Typography           | `chapter-03-typography.md`, `appendix-fonts.md`     |
| Proportions & layout | `chapter-05-proportions.md`                         |
| Composition          | `chapter-06-composition.md`                         |
| Color                | `chapter-08-color-science.md`, `chapter-09-theory.md`|
| Design identity      | `ai-tells.md`                                       |

Output: findings table by severity with principle citations.

This presentation was designed with it. The font pairing, color system, spacing — all grounded in theory.

Facilitator — Help the human think

The output isn't code or a report. It's discovered requirements that didn't exist before the conversation. Ask, listen, reflect, ask deeper.

research uses progressive narrowing to help users articulate what they want:

Help the user figure out what they want and get it written down.
They might arrive with a vague vision, a half-formed idea, or just
a problem they feel. Your job is facilitation.

## How You Talk

**Short turns.** A sentence or two of observation, then a question.
Not a paragraph of analysis.

**Have opinions.** "That sounds like a notification problem more than
a feed problem" is useful. "There are several ways to think about
this" is not. Be wrong sometimes — it's faster than being neutral.

**Match their energy.** If they're terse, be terse. If they're
thinking out loud, think with them.

**No preamble.** Don't announce what you're about to do.

## Progressive Narrowing

Each question should make the problem space smaller.

**Purpose**    — Why does this need to exist?
**Actors**     — Who uses it? Who benefits? Who pays?
**Context**    — What exists today? What's the current pain?
**Boundaries** — What's explicitly out of scope?
**Needs**      — What must it do? Priority order?
**Risks**      — What must be true for this to work?

The only pattern no framework literature has named. They all focus on agents that do work. Nobody writes about agents that help humans think.

Choosing a pattern

QuestionPattern
Does the task need input classified before acting?Router
Is it multi-step with dependencies?Workflow
Does it need to verify output quality?Validator
Is it rules → input → output in one pass?Transformer
Does it diagnose problems?Diagnostic
Does it make an external tool usable?Tool Wrapper
Does it load a mental model into context?Reference Frame
Does it help the user articulate what they want?Facilitator

Everything composes

Patterns compose inside a skill. Skills compose with each other.

Patterns inside a skill

SkillPatterns
web-researchRouter + Workflow
post-gate-agentValidator + Ref Frame
design-for-aiRef Frame + Router
planWorkflow + Diagnostic
diagnoseDiagnostic + Tool Wrapper

Skills with each other

Skills can load other skills at runtime. Combine them and you get capabilities neither has alone.

CombinationWhat you get
debug + performanceDebugging guided by profiling methodology
build + post-gate + ref framesImplementation reviewed against domain checklists
plan + clarifyRequirements discovered before architecture begins
03

Craft

The research on what actually works.

Progressive Disclosure

Dump everything into context and 11 out of 13 models drop below baseline at 32K tokens. Load selectively and you get 3.5x better results. Three levels is all you need.

LevelWhatSizeWhen loaded
1 description ~100 tokens Always in context
2 SKILL.md body <500 lines On trigger
3 references/ Unlimited On demand

Real example from design-for-ai:

# Level 1 — always in context (~100 tokens)
description: "Visual design principles from Design for Hackers.
  Use when building or improving UI/frontend design..."

# Level 2 — SKILL.md body, loaded on trigger (~200 lines)
## Routing
| Mode  | User says                    |
|-------|------------------------------|
| audit | "something's off" / "review" |
| fonts | "pick fonts" / "typography"  |

## audit
Read `${CLAUDE_SKILL_DIR}/references/checklists.md`
For top 2-3 worst sections, load the chapter reference...

# Level 3 — references/, loaded on demand (unlimited)
# references/chapter-03-typography.md    (~350 lines)
# references/chapter-09-color-theory.md  (~380 lines)
# references/ai-tells.md                (~360 lines)
# ... 12 reference files, loaded ONLY when needed

MEM1 (2025) — selective loading, 3.5x performance at 3.7x less memory. NoLiMa (ICML 2025) — 11/13 models below 50% baseline at 32K. Anthropic — compaction preserves first 5,000 tokens per skill, 25,000 budget across all loaded skills.

Structure Before Wording

Tell the agent how to answer before asking it to reason. It skips the reasoning and jumps straight to the answer. VISTA (2026)

87.6%
think → answer
13.5%
answer → think

Structure your skill so the model thinks before it answers.

The scarier finding: when LLMs self-reflect on failures, zero structural attributions across all configurations. They can't see their own structural problems — only humans can audit the skeleton.

Validated ordering

FSE 2025, production templates at Uber/Microsoft

1. Role / Identity
2. Directive
3. Context / Workflow
4. Output Format
5. Constraints

This applies everywhere: workflow phases build understanding before decisions, output schemas have reasoning before conclusions, gates gather evidence before verdicts.

Tables for Decisions, Gates for Steps

Write your constraints in a paragraph. By paragraph three, the model has forgotten them. NLD-P (2026)

# ❌ Prose (dissolves)
If the user wants a quick answer, use scan mode. If they want a standard
research question, use brief mode. For a deep dive with verification...

# ✅ Table (holds) — from web-research
## Depth Modes

| Mode      | When                          | Output                              |
|-----------|-------------------------------|-------------------------------------|
| **scan**  | Quick answer, sanity check    | 1-pager: top 3 findings + URLs      |
| **brief** | Standard research question    | Synthesized brief w/ recommendations |
| **breadth** | Map a space, survey options | Landscape: categories, players, gaps |
| **deep**  | Decision-critical, needs proof| Full report, confidence levels       |

# ✅ Gates (sequence constraints) — from post-gate-agent

## STOP — Load Standards and Checklists
Read `$PLUGIN_ROOT/references/post-gate-standards.md`
Then follow every `Read()` directive in that file.

## STOP — Load Skills as Checklists
If dispatch includes `## Additional Skills`, load each listed skill.

## Review Steps   ← only now does the actual work begin

Mittal (2026) — each additional simultaneous constraint reduces compliance 2–21%.

FSE 2025 — explicit exclusion constraints: format-following 40% → 100%.

Critical Rules at Beginning AND End

Move a critical instruction from position 1 to position 10. 30%+ accuracy drop. The model pays attention to the beginning and the end. The middle is a graveyard. Liu et al. (TACL 2024), confirmed by OpenAI, Anthropic, Google.

From write — rules at the top, operationalized at the bottom:

# ↑ TOP OF SKILL — state the rules with research backing

## Core Rules (always active)

These five rules address the structural signals that blind-test
research identified as hardest to fake and most robust for detection.

### 1. Lurch — Vary sentence length violently.
### 2. Spike — Vary information density across paragraphs.
### 3. Wander — Don't follow the outline.
### 4. Shift Register — Move between precise and casual.
### 5. Get Specific — Reference a particular paper, failure, afternoon.

# ... 60 lines of surface rules, deep craft, examples ...

# ↓ BOTTOM OF SKILL — operationalize the same rules as a checklist

## Self-Check (run silently before finalizing)

1. Sentence length range — shortest vs longest. <20-word gap? Fix.
2. Three consecutive same-length sentences? Break one.
3. Register — did you shift at least twice?
4. Kill list — scan for banned words from surface rules.
5. Density — every paragraph same density? Compress one, stretch another.
6. Specificity — at least one concrete reference a generic model wouldn't?
7. Structure — could someone predict the org from paragraph 1? Rearrange.

State the rules at the top. Operationalize them as a checklist at the bottom. Skimmers hit the rules. Finishers hit the checklist. Nobody escapes.

Separate Governance from Task

Mix your rules into the task instructions. Update the model. Watch the rules vanish. NLD-P (2026), HIPO (2026)

From build-agent — governance sections you could extract and reuse on any task:

# GOVERNANCE — applies to ANY phase, ANY codebase

## STOP — Load Standards and Checklists            ← Layer 1: Identity
Before any work, read both standards files:
1. `Read($PLUGIN_ROOT/references/pre-gate-standards.md)`
2. `Read($PLUGIN_ROOT/references/implement-standards.md)`
Then follow every `Read()` directive in those files.

## STOP — Read Input Files First                   ← Layer 2: Constraints
| Source               | Purpose                  | Required |
|----------------------|--------------------------|----------|
| Discovery + Design   | What exists, gaps, decisions | YES  |
| Plan file            | Requirements context     | YES      |

## STOP — Load Skills and Checklists               ← Layer 2: Constraints
If dispatch includes `## Additional Skills`, load each.

# ──────────────────────────────────────────────────
# TASK — the actual work, isolated below

## Phase 1: Discovery + Design                     ← Layer 3: Task
Apply the pre-gate standards (design-it-twice, depth eval, skip criteria).
### Scope the Phase
- [ ] Do the files listed in the plan exist?
- [ ] Read each file. Note current state.

## Phase 2: TDD Implementation                     ← Layer 3: Task
...

The test: can you pull out the governance sections and apply them to a completely different task? If yes, they're properly separated.

04

Evaluation &
Distribution

Testing that skills work — and getting them to people.

Testing skills

Skills exist to make the agent more successful at a task.
Test that it does the thing.

Task typeGoalVerifiability
Deterministic
screenshot, deploy, format
Same result every time Easiest
Guided
code review, debugging
Follows the methodology Medium
Exploratory
edge cases, research
Covers ground the agent wouldn't find alone Hardest

Manual testing is the current answer. Eval frameworks are emerging.

Two ways to ship

Method How it works Best for
Project skills
.claude/skills/
Drop a SKILL.md in the repo Project-specific knowledge, team conventions
Plugin install
/plugin install
Git repo with manifest, /plugin install author@name Reusable tools, shared across projects

Project skills are fast. Plugins are distributable.
Commands are namespaced (/plugin:command). Skills auto-trigger by description match.

What ships depends on the pattern

Pattern Typical contents
Facilitator, Reference Frame SKILL.md only — no code, no MCP
Router, Workflow, Validator SKILL.md + references/ + maybe agents/
Transformer, Diagnostic SKILL.md + references/ (rulesets, taxonomies)
Tool Wrapper SKILL.md + MCP server + src/ + deps

Simple patterns ship easily. The powerful patterns — MCP servers, hooks, subagent dispatch — need the full plugin system.

Additional Resources

Documentation

  • Extend Claude with skills
    code.claude.com/docs/en/skills
  • The Complete Guide to Building Skills for Claude
    anthropic.com/research/building-skills-for-claude
  • Equipping Agents for the Real World
    anthropic.com/engineering/equipping-agents-for-the-real-world

Prompting guides

  • OpenAI GPT-4.1 Prompting Guide
    cookbook.openai.com/examples/gpt4-1_prompting_guide
  • Google Gemini Prompting Strategies
    ai.google.dev/gemini-api/docs/prompting-strategies

Research cited

  • Liu et al., "Lost in the Middle," TACL 2024
    arxiv.org/abs/2307.03172
  • Modarressi et al., "NoLiMa," ICML 2025
    arxiv.org/abs/2502.05167
  • Zhou et al., "MEM1," 2025
    arxiv.org/abs/2506.15841
  • "From Prompts to Templates," FSE 2025
    arxiv.org/abs/2504.02052
  • Liu et al., "VISTA," 2026
    arxiv.org/abs/2603.18388
  • NLD-P, 2026
    arxiv.org/abs/2602.22790
  • Mittal, "Prospective Memory Failures," 2026
    arxiv.org/abs/2603.23530
  • Chen et al., "HIPO," 2026
    arxiv.org/abs/2603.16152

Skills:
Programming the AI

Penman is 37 lines. The build pipeline enforces multi-phase workflows through task dependencies. Design-for-ai changed how this presentation was designed. A markdown file can do all of that.