How Scoring Works

Every instruction file on MarkedDown is scored with deterministic, reproducible tests. No subjective ratings. No vibes. The same file, the same test suite, the same scoring engine — run against every model we support. Here's how it works.

The problem we're solving

An instruction file that works perfectly on Claude Sonnet might fail on GPT-4o. A rule set that GPT-4o follows to the letter might confuse Gemma. Models interpret instructions differently — different attention patterns, different training data, different alignment tuning.

Before MarkedDown, there was no way to know how well an instruction file would work on your model without manually testing it. Copy the file, paste it into your tool, run some prompts, eyeball the output, and hope your sample was representative. That's not testing — that's guessing.

MarkedDown replaces guessing with measurement.

The test pipeline

When a file is tested against a model, it goes through a structured pipeline with up to three tiers of increasing difficulty:

Sanity

Sanity Check (5 cases)

Basic instruction adherence. Does the model acknowledge the role? Does it respond in the expected format? Does it refuse to break character when pushed? If a file fails sanity checks, it's fundamentally broken for that model — no point running deeper tests.

Tier 1 (20 cases)

Functional coverage. Tests whether the model actually uses the skills, rules, and constraints defined in the instruction file. Each test case targets a specific behavior: "Does the code reviewer flag security issues?" "Does the persona stay in character under adversarial prompts?" Scored deterministically — pattern matching, keyword extraction, structural analysis. No LLM judges at this tier.

Tier 2 — Escalation (variable cases)

Difficulty ratchet. If a model scores 0.9 or higher on Tier 1, Tier 2 fires automatically. This tier uses a student/tutor/oracle loop: a "student" model attempts harder tasks guided by the instruction file, a "tutor" evaluates whether the output follows the instructions, and an "oracle" judge scores nuanced rubric criteria that can't be captured by pattern matching.

Tier 2 exists because a 0.95 on easy tests and a 0.95 on hard tests are very different claims. The escalation ratchet separates models that superficially follow instructions from models that deeply internalize them.

The scoring engine

Scores are computed by a deterministic scoring engine — no randomness, no model-in-the-loop (except for the Tier 2 oracle judge on rubric criteria). The same file + same model + same test suite = the same score every time. This is critical for comparability.

The engine checks:

  • Structural compliance — does the output match the expected format (JSON, Markdown, bullet points)?
  • Keyword presence — does the output contain required technical terms, constraints, or role indicators?
  • Constraint adherence — does the model respect "never" and "always" rules without exception?
  • Role stability — does the persona hold under follow-up prompts and edge cases?
  • Rubric depth (Tier 2 only) — does the output demonstrate genuine understanding vs. surface-level pattern matching?

Each test case returns a pass/fail. The final score is the ratio of passed cases to total cases, expressed as a decimal between 0 and 1. A score of 0.85 means 85% of test cases passed.

What the scores mean

0.90 — 1.00

Excellent compatibility. The model follows this instruction file reliably. Tier 2 escalation has been triggered, meaning performance holds under harder conditions. Safe to use in production workflows.

0.70 — 0.89

Good compatibility. The model follows most instructions but may miss edge cases or occasionally break constraints. Usable, but review output for the specific behaviors you care about.

0.50 — 0.69

Partial compatibility. The model understands the role but doesn't consistently follow rules or constraints. Consider a different model for this file, or simplify the instructions.

Below 0.50

Poor compatibility. The model doesn't reliably follow this instruction file. The file may need restructuring for this model, or the model may simply lack the instruction-following capability required.

BYOK — Bring Your Own Key

MarkedDown doesn't subsidize API costs. When you run a test, you provide your own API key for the model you want to test. Your key is used for the test run only — it's never stored, logged, or transmitted anywhere except directly to the model provider's API.

Pre-cached scores (the ones you see on file pages without running a test) are seeded by the MarkedDown team using our own keys. These cover the most common models so you can browse compatibility without spending anything.

See it in action

The best way to understand scoring is to look at real results:

  • Browse the library — every file shows its compatibility scores on the detail page
  • Compare two files — see a head-to-head compatibility matrix across models
  • View all models — see which models are supported and how many files they've been tested against