Skip to content
guides-tutorials

Structured Outputs with AI 2026: JSON, XML and Reliable Parsing

For production apps, LLM outputs must be machine-readable. The 2026 guide to structured outputs: native JSON modes, XML tags, schema validation and robust parsing strategies.

  • #Prompt Engineering
  • #Structured Outputs
  • #JSON Mode
  • #JSON Schema
  • #XML Tags
  • #Pydantic
  • #Zod Validation
  • #LLM Parsing
  • #Function Calling
  • #Structured Generation
  • #OpenAI JSON
  • #Anthropic XML
Structured AI Outputs 2026: JSON, XML & Schema Validation — hero image: Get reliable validated JSON/XML outputs from LLMs

Affiliate Disclosure: Some links on this page are affiliate links. If you purchase through them, we may earn a small commission — at no extra cost to you. These recommendations are independent and based on our own research.

To the main article and all detail articles
Jump directly to the central overview page and all relevant detail articles of this cluster.
Main articleCentral overview page
Prompt Engineering 2026 – The Complete Guide for Professional AI Use
All core info, context, updates and internal jumps in one place.
Update history (2)
  1. Native JSON modes in GPT-4.5, Claude 3.5 and Gemini 2.0 documented, JSON Schema Draft 2020-12 as the standard, Pydantic and Zod workflows updated.
  2. Original publication with practical guide to structured outputs including schema validation, repair strategies and the invoice-extraction example.

Short answer

Structured outputs in 2026: why JSON prompting is no longer a hack

Two years ago, persuading a large language model to return parseable JSON was a mix of prayer and regex. You wrote a careful system prompt, begged the model not to wrap its answer in Markdown code fences, added a few few-shot examples, and then you still wrote a forgiving parser that could strip explanations, repair dangling commas and reconstruct the occasional truncated array. Teams that built on top of GPT-3.5 and early Claude versions accumulated entire libraries of defensive parsing code just to keep production pipelines alive, and most of them carried at least one open incident per week that traced back to a model deciding to be helpful by adding a trailing sentence like “Hope this helps!” after the closing brace.

In May 2026 that era is finally over. Every major provider ships a first-class mechanism for schema-constrained output. OpenAI’s ChatGPT Structured Outputs feature, rolled out in general availability with GPT-4.5, enforces a supplied JSON Schema at the token sampling level so the model is literally unable to emit invalid JSON. Anthropic’s Claude exposes the same capability through Tool Use with input_schema, where Claude 3.5 Sonnet and Opus validate tool arguments against the schema on the server side before returning them to you. Google’s Gemini 2.0 accepts a responseSchema parameter that is enforced by the same constrained decoding pipeline used for function calling. The three approaches differ in syntax and in a few subtle guarantees, but they converge on the same promise: you describe the shape of the answer you want, and the model delivers exactly that shape every single time.

Structured output is one of the seven core techniques from our Prompt Engineering 2026 guide — this article goes deep on it for production use.

That changes how you build. A robust extraction pipeline no longer needs a retry loop with exponential backoff around a flaky parser. A tool-using agent does not need a second model call to clean up malformed arguments. A data ingestion script that turns PDFs into database rows can finally skip the “manual review” queue for structural errors and focus on semantic correctness. If you are still writing repair prompts as your first line of defence, you are solving a problem that the API has already solved for you — and paying for the privilege in latency, tokens and on-call pages.

The rest of this guide walks through the three provider implementations, the schema authoring conventions that make them reliable, the validation layers you still need on the client side even with strict mode, and a full end-to-end example that extracts structured invoice data from a PDF. Along the way we look at XML prompting, which remains the best choice for a specific class of problems Claude handles especially well, and at the failure modes that still trip up teams migrating from the old world of free-text parsing.

Native JSON mode: what GPT-4.5, Claude 3.5 and Gemini 2.0 offer

All three frontier providers now ship native JSON output, but the feature space differs enough to matter when you pick a default model for a pipeline. OpenAI distinguishes two levels: a legacy response_format: { type: "json_object" } that guarantees syntactically valid JSON without constraining the keys, and the newer response_format: { type: "json_schema", strict: true } that binds the output to a supplied JSON Schema at the decoder. The strict mode is the one you want in production. Once strict: true is set, the model cannot emit a missing required field, cannot introduce extra keys when additionalProperties is false, and cannot violate an enum or a numeric constraint. The cost is a small warm-up latency on the first call with a new schema, because the server compiles the schema into a finite-state machine for constrained decoding and then caches it per model.

Anthropic takes a slightly different route. Claude 3.5 does not expose a top-level response_format but instead leans on Tool Use. You declare a tool whose input_schema is the JSON Schema of the object you want, you instruct the model to call that tool, and you read the arguments from the resulting tool_use block. Internally the mechanism is the same constrained generation — Claude validates the tool arguments against the schema before returning them — but the developer surface is expressed in the tool-calling idiom. That has two practical consequences: first, you can combine schema-bound output with actual tool execution in the same turn, which is elegant for agentic workflows; second, you sometimes need a tiny wrapper to make pure extraction feel natural, because you are nominally calling a “tool” that is really just your output contract.

Gemini 2.0 offers the most declarative API of the three. You set response_mime_type: "application/json" and provide a responseSchema in the generation config, and Google’s decoder enforces both the MIME type and the schema. Gemini’s schema dialect is a subset of OpenAPI 3.0 rather than pure JSON Schema 2020-12, which means a few constructs — notably $ref, oneOf with discriminators, and some format validators — behave differently or need to be flattened before submission. For most straightforward extraction schemas you will not notice, but for deeply recursive structures you may find yourself writing a small normaliser that converts your canonical Zod or Pydantic schema into Gemini-friendly form.

Under the hood, all three providers now use some variant of constrained decoding: at each token step, the set of allowed next tokens is masked to those that can still lead to a valid document under the grammar derived from your schema. That is what makes “100% valid outputs” a defensible claim. It is not a guarantee about semantic correctness — the model can still return the wrong number for a price — but the structure is no longer a source of production errors. When you feel the urge to keep a defensive parser around because “you never know”, ask yourself whether you have observed a single structural failure from strict mode in the last thousand calls. The honest answer is almost always no.

JSON Schema in the prompt: forcing 100% valid outputs

Even with strict mode, the schema itself is the contract, and badly written schemas still produce badly shaped answers. The first rule is to be explicit and narrow. A field that should only ever be "pending", "paid" or "cancelled" belongs in an enum, not as string. A date should carry format: "date" and, if you want ISO 8601 timestamps, format: "date-time". A price in cents should be an integer with a minimum: 0, not a number that invites the model to emit 19.9900000001. Every field that you consider mandatory should appear in required, and additionalProperties should be explicitly false for extraction schemas — if you allow extras, some models will occasionally invent helpful metadata like "confidence_note" that your downstream code will then have to tolerate.

The second rule is to put descriptions everywhere the model needs them. A JSON Schema field can carry a description string, and providers that support strict mode pass that description into the prompt context. A description: "Invoice total in cents, excluding tax" on an integer field is worth five lines of instructions in the system prompt, because it is attached to the exact point where the model decides what to output. Teams that moved from instruction-heavy system prompts to description-heavy schemas typically see extraction accuracy rise by several percentage points without any other change, and the system prompt shrinks to a few sentences of context.

The third rule is to keep the schema closed under iteration. As you add fields over the lifetime of a project, resist the temptation to reuse the same schema for slightly different tasks. A schema that works for German invoices and a schema that works for US invoices should be two schemas, even if they share ninety percent of their fields, because a single schema with a growing list of optional fields slowly erodes the model’s ability to focus. If you find yourself writing conditional if/then/else clauses in JSON Schema to cover regional variants, split the schema and route by locale before the call.

The fourth rule is to take Draft 2020-12 seriously. Most providers now accept the 2020-12 dialect, which introduces $defs for reusable fragments, prefixItems for tuple-like arrays, and better handling of unevaluatedProperties. Using $defs to factor out repeated sub-objects (a postal address, a currency amount, a person with first and last name) makes large schemas much more maintainable and reduces the number of tokens they consume when inlined into the prompt, because the constrained decoder expands the reference only once per path.

OpenAI Structured Outputs API with response_format and strict mode

The canonical OpenAI pattern looks like this in Python:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4.5",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_extraction",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "invoice_number": {
                        "type": "string",
                        "description": "Unique invoice identifier as printed on the document",
                    },
                    "issue_date": {
                        "type": "string",
                        "format": "date",
                        "description": "Issue date in ISO 8601 format, YYYY-MM-DD",
                    },
                    "total_cents": {
                        "type": "integer",
                        "minimum": 0,
                        "description": "Gross total in cents, no decimals",
                    },
                    "currency": {
                        "type": "string",
                        "enum": ["EUR", "USD", "GBP", "CHF"],
                    },
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "quantity": {"type": "integer", "minimum": 1},
                                "unit_price_cents": {"type": "integer", "minimum": 0},
                            },
                            "required": ["description", "quantity", "unit_price_cents"],
                            "additionalProperties": False,
                        },
                    },
                },
                "required": ["invoice_number", "issue_date", "total_cents", "currency", "line_items"],
                "additionalProperties": False,
            },
        },
    },
    messages=[
        {"role": "system", "content": "You extract invoice data from raw OCR text."},
        {"role": "user", "content": ocr_text},
    ],
)

payload = response.choices[0].message.content

The returned content is always a JSON string that parses cleanly and validates against the schema. Notice the absence of any instruction along the lines of “respond only in valid JSON, do not wrap in Markdown”. That instruction is obsolete when strict: true is set; the decoder literally cannot produce anything else. A common migration mistake is to keep those guardrails in the system prompt and then wonder why responses feel a touch stiffer than before — the instructions are now competing with the constraint rather than reinforcing it.

Two practical notes. First, not every JSON Schema construct is supported in strict mode. OpenAI documents a subset: oneOf is allowed, anyOf is not, $ref is supported within the same document, and advanced validators like patternProperties are rejected. The API returns a descriptive error on the first call if your schema violates the subset, and the fix is usually to flatten or rewrite. Second, strict mode counts the schema against your input tokens. A 2,000-token schema on every call adds up; we return to that cost question further down.

Anthropic Claude: Tool Use for schema-validated JSON responses

With Claude 3.5 you reach the same outcome through the Tool Use API. The mental model shifts from “ask the model to respond in a format” to “offer the model a tool whose only purpose is to structure the answer”. In practice it looks like this:

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "record_invoice",
        "description": "Record the extracted invoice fields.",
        "input_schema": {
            "type": "object",
            "properties": {
                "invoice_number": {"type": "string"},
                "issue_date": {"type": "string", "format": "date"},
                "total_cents": {"type": "integer", "minimum": 0},
                "currency": {"type": "string", "enum": ["EUR", "USD", "GBP", "CHF"]},
                "line_items": {
                    "type": "array",
                    "items": {
                        "type": "object",
                        "properties": {
                            "description": {"type": "string"},
                            "quantity": {"type": "integer", "minimum": 1},
                            "unit_price_cents": {"type": "integer", "minimum": 0},
                        },
                        "required": ["description", "quantity", "unit_price_cents"],
                    },
                },
            },
            "required": ["invoice_number", "issue_date", "total_cents", "currency", "line_items"],
        },
    }
]

message = client.messages.create(
    model="claude-3-5-sonnet-latest",
    max_tokens=2048,
    tools=tools,
    tool_choice={"type": "tool", "name": "record_invoice"},
    messages=[
        {"role": "user", "content": f"Extract invoice fields from:\n\n{ocr_text}"},
    ],
)

tool_use = next(block for block in message.content if block.type == "tool_use")
payload = tool_use.input

The tool_choice parameter is what turns this into a pure extraction call: by forcing the model to use a specific tool, you guarantee that the response is exactly one tool_use block whose input matches the schema. Claude’s server-side validation means that by the time you reach tool_use.input, the object has already been checked. You still want a client-side validator, because the server cannot enforce semantic invariants — for example, that sum(line_items.unit_price_cents * quantity) == total_cents — but the shape is solid.

Claude’s Tool Use pattern has one notable ergonomic advantage over OpenAI’s response_format: you can define several tools and let the model choose. In an agentic setting that might be record_invoice alongside record_receipt and record_credit_note, and Claude will pick the one that matches the document in front of it. That branching capability is why many teams building mixed extraction and action pipelines standardise on Claude for the routing layer.

Google Gemini: responseSchema and automatic type enforcement

Gemini 2.0 collapses the whole story into generation config. You declare the response MIME type and the schema, and the rest of the request looks like a normal content call:

import google.generativeai as genai

genai.configure(api_key=API_KEY)

schema = {
    "type": "object",
    "properties": {
        "invoice_number": {"type": "string"},
        "issue_date": {"type": "string", "format": "date"},
        "total_cents": {"type": "integer"},
        "currency": {"type": "string", "enum": ["EUR", "USD", "GBP", "CHF"]},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "quantity": {"type": "integer"},
                    "unit_price_cents": {"type": "integer"},
                },
                "required": ["description", "quantity", "unit_price_cents"],
            },
        },
    },
    "required": ["invoice_number", "issue_date", "total_cents", "currency", "line_items"],
}

model = genai.GenerativeModel("gemini-2.0-pro")

response = model.generate_content(
    f"Extract invoice fields from:\n\n{ocr_text}",
    generation_config={
        "response_mime_type": "application/json",
        "response_schema": schema,
    },
)

payload = response.text  # always valid JSON under the schema

Gemini’s schema dialect is OpenAPI 3.0, which trips up teams who paste a Zod- or Pydantic-generated JSON Schema directly. The two most common issues: integer with minimum is fine, but some format validators (such as uuid) are silently ignored, so you need application-level validation for them. And nullable fields work via nullable: true on the field, not via "type": ["string", "null"]. Write a small adapter once — it is usually under thirty lines — and you can keep a single canonical schema per domain object that you normalise for each provider at the edge.

On the plus side, Gemini’s constrained decoding is fast, the MIME-type guarantee means no stray Markdown, and the response includes no additional explanation unless you explicitly allow it. For large-batch document processing workloads, Gemini 2.0 Pro often comes in at the lowest cost per extracted record of the three providers, especially when you use the Flash variant for the simpler documents.

XML prompting: when XML beats JSON

Despite the JSON renaissance, there is a class of problems where XML still wins, and Claude in particular responds beautifully to it. The pattern is: structured output that mixes machine-parseable scaffolding with long free-text content. Think of a cover letter with <introduction>, <body>, <closing> sections, or a legal document summary with <facts>, <legal_reasoning> and <conclusion>, or a multi-section product description with <benefits>, <technical_specs> and <warranty_terms>. In JSON these would become objects with string values that contain large paragraphs, and the escaping becomes painful — literal quotes, line breaks, and the occasional & all need to survive round-tripping.

XML avoids that friction. You ask the model to return:

<cover_letter recipient="ACME Corp" role="Senior Backend Engineer">
  <introduction>
    Dear hiring team,
    I read your posting for a backend engineer...
  </introduction>
  <body>
    In my last role at ...
  </body>
  <closing>
    I would be delighted to discuss ...
  </closing>
</cover_letter>

and Claude reliably produces it, including attributes on the root element. A small XML parser — lxml in Python, fast-xml-parser in TypeScript — turns the result into a dict you can iterate over. For text-heavy outputs you save on escaping, you gain readability during prompt development, and you sidestep the edge cases where JSON string mode struggles with embedded code blocks that contain curly braces.

The practical rule: reach for JSON when the fields are short and structured (IDs, dates, amounts, enums), and reach for XML when at least one field is a paragraph of prose that a human might still read in the raw form. For hybrid cases, you can combine: a JSON envelope with a single field that carries an XML string, or the other way around. On Claude, XML still triggers a noticeably more precise response style because the training data leaned heavily on XML-style scaffolding for agent traces.

Pydantic, Zod and zod-to-json-schema in production workflows

Hand-writing JSON Schema in a dictionary gets tedious fast. In TypeScript and Python you already have expressive, typed schema libraries, and both of them now ship first-class JSON Schema emission. In Python, Pydantic v2 is the default:

from pydantic import BaseModel, Field
from typing import Literal

class LineItem(BaseModel):
    description: str
    quantity: int = Field(ge=1)
    unit_price_cents: int = Field(ge=0)

class Invoice(BaseModel):
    invoice_number: str
    issue_date: str  # Pydantic can also use date, but providers expect ISO strings
    total_cents: int = Field(ge=0)
    currency: Literal["EUR", "USD", "GBP", "CHF"]
    line_items: list[LineItem]

schema = Invoice.model_json_schema()

model_json_schema() returns a dictionary that you can hand straight to OpenAI’s json_schema, Claude’s input_schema or Gemini’s response_schema. The same model class then validates the response on the way back, giving you a typed Invoice object throughout the rest of your pipeline. The symmetry eliminates an entire category of bugs where the schema sent to the model drifts out of sync with the parser that reads the response.

TypeScript has a similarly neat loop with Zod. You define the schema once and use zod-to-json-schema to emit the JSON Schema dialect each provider expects:

import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";

const LineItem = z.object({
  description: z.string(),
  quantity: z.number().int().min(1),
  unit_price_cents: z.number().int().min(0),
});

const Invoice = z.object({
  invoice_number: z.string(),
  issue_date: z.string(),
  total_cents: z.number().int().min(0),
  currency: z.enum(["EUR", "USD", "GBP", "CHF"]),
  line_items: z.array(LineItem),
});

const jsonSchema = zodToJsonSchema(Invoice, { target: "openApi3" });

// After the LLM call:
const parsed = Invoice.parse(JSON.parse(payload));

The target option handles provider differences: pick jsonSchema2019-09 or openApi3 depending on whether you are talking to OpenAI/Anthropic or Gemini. The OpenAI SDK for TypeScript now also offers a zodResponseFormat helper that wires Zod schemas directly into response_format without the manual conversion, which shrinks the glue code further.

The single biggest reason to go through Pydantic or Zod rather than writing JSON Schema by hand is refactoring. When a schema changes — a new field, a stricter constraint, a type migration from integer to decimal — the type-checker catches every downstream consumer before you ship. In handwritten schemas you find out at runtime, on a customer record, in production.

The 7 most common JSON parsing failures and how to catch them

Even in the strict-mode era, pipelines fail. Seven patterns cover the vast majority of incidents we have seen across teams migrating to structured outputs. First, legacy model fallbacks: a team uses GPT-4.5 in strict mode but has a fallback to an older model that does not support it, and the fallback path silently returns JSON-ish text. The fix is to make the fallback model a different schema call entirely, not an implicit retry with the same parameters. Second, provider schema-dialect mismatches: a schema works on OpenAI, breaks on Gemini because of an unsupported format. The fix is the normalisation adapter mentioned earlier.

Third, semantic nulls. null is a valid JSON value, and a lenient schema will let the model return it whenever it is unsure. If null is not actually acceptable in your pipeline, reflect that in the schema with explicit required and no nullable union. Fourth, empty arrays as a fallback. When the model cannot find any line items in an invoice, a permissive schema invites an empty line_items: [], and your downstream code treats that as “this is a valid zero-item invoice” rather than “the extraction failed”. Use minItems: 1 where zero is not a meaningful answer, and catch the validation error as a signal.

Fifth, Unicode trip-ups. Some OCR sources produce characters that round-trip oddly through JSON string escapes, particularly pre-composed versus decomposed accented characters and unusual currency symbols. Normalise inputs before the call and validate outputs with a simple UTF-8 check; surface any oddities as warnings rather than failures. Sixth, long-tail numeric representation. Even with integer types, very large numbers sneak in via IEEE 754 loss when another layer (a JavaScript JSON.parse on the server, for instance) touches the payload. Keep money in cents, keep identifiers as strings, and never store a credit card number in JSON at all.

Seventh, the classic one: timeouts mid-generation. With a very large schema or a very long document, a request can hit the provider-side generation limit before the full document is produced. With constrained decoding, you never get malformed JSON — you get a partial, valid-up-to-the-truncation document that may be missing required fields. Catch this by setting explicit max_tokens well above the expected output size and by treating schema validation errors as transient for the first retry.

Schema validation with JSON Schema Draft 2020-12 against LLM output

Server-side strict mode is not a substitute for client-side validation. It tells you the document is structurally valid against the schema at generation time, but the schema you send to the provider and the schema your application really depends on are not always the same thing. Providers often accept a subset, some formats are advisory, and semantic invariants (sums, cross-field constraints, referential integrity) are not expressed in JSON Schema at all. A good pipeline therefore runs two validation passes: a structural one using jsonschema in Python or ajv in TypeScript with full Draft 2020-12 support, and a semantic one that asserts business rules.

import jsonschema

validator = jsonschema.Draft202012Validator(full_schema_with_all_constraints)
errors = sorted(validator.iter_errors(payload), key=lambda e: e.path)
for err in errors:
    log.warning("Schema violation", path=list(err.path), message=err.message)

The critical habit is to log structural violations at warning level even when strict mode should have prevented them. On the day a provider ships a subtle regression — and it will happen again, as it did in several episodes during 2025 — you want to know immediately, not three weeks later when a customer notices.

For semantic validation, hand-written assertions are usually clearer than squeezing them into JSON Schema. An invoice check reads like “if discount is non-zero, discount must be less than subtotal”, expressed as a few lines of Pydantic validators or Zod refinements, and lives next to the schema it guards. That combination — strict mode at the decoder, structural validation at the boundary, semantic validation in domain code — is what takes a pipeline from “works on the happy path” to “we have not had a parsing incident this quarter”.

Cost: token overhead for schema definitions versus error prevention

Schemas cost tokens. A thirty-field invoice schema with descriptions on every field can run to 1,500 or 2,000 input tokens, and you pay for them on every call. Teams sometimes look at that line on the monthly bill and wonder whether the old, looser prompts were cheaper. They were not, once you account for the full picture.

The comparison is roughly this. Without strict mode, you run a free-text prompt and a lightweight parser. Parsing fails on, say, three percent of calls, each of which triggers a repair prompt — another full call with the original context plus the error — and a second parse. You also carry a manual review queue for the long tail of parses that succeed syntactically but return the wrong shape. Each step has a latency cost, a token cost and an operator cost. With strict mode, you pay a larger fixed token cost per call in exchange for a negligible failure rate and no repair loop.

A concrete mental model: if your schema costs 1,000 additional input tokens and you were previously retrying three percent of calls with a repair cycle that roughly doubled their cost, the break-even is around 33 calls per day — below that, the old approach is slightly cheaper; above that, strict mode wins, and the operational benefits start to dominate long before that. Any pipeline with meaningful volume crosses the line on day one. The exception is genuinely exploratory prompting — a research notebook, a one-off script — where a loose prompt plus a forgiving parser is still fine, because you are the one reading the output.

Two levers reduce schema cost further. First, move stable field descriptions to the system prompt only where they describe the task, and keep the per-field descriptions focused on what is unique about each field. Second, use $defs for recurring sub-objects; a single Address definition referenced three times costs once. Third, on OpenAI and Gemini, schemas are cached on the server by hash; the latency penalty on the first call with a new schema disappears on subsequent calls, and on high-volume routes the effective overhead is minimal.

Example: invoice extraction from PDFs with structured output

A full pipeline for PDF invoice extraction ties the threads together. The input is a PDF, the output is a validated Invoice record written to a database. The steps are: convert the PDF to text with an OCR layer (pdfplumber for native-text PDFs, Tesseract or a cloud OCR for scans), clean up the text, send it to the model with a schema, validate semantically, and commit.

import pdfplumber
from pydantic import BaseModel, Field, field_validator
from typing import Literal
from openai import OpenAI

class LineItem(BaseModel):
    description: str
    quantity: int = Field(ge=1)
    unit_price_cents: int = Field(ge=0)

class Invoice(BaseModel):
    invoice_number: str
    issue_date: str
    total_cents: int = Field(ge=0)
    currency: Literal["EUR", "USD", "GBP", "CHF"]
    line_items: list[LineItem]

    @field_validator("total_cents")
    @classmethod
    def total_matches_line_items(cls, v, info):
        items = info.data.get("line_items") or []
        computed = sum(it.quantity * it.unit_price_cents for it in items)
        if items and abs(computed - v) > 100:  # 1 EUR tolerance
            raise ValueError(f"Total {v} does not match line items {computed}")
        return v

client = OpenAI()

def extract_invoice(pdf_path: str) -> Invoice:
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join(page.extract_text() or "" for page in pdf.pages)

    response = client.chat.completions.create(
        model="gpt-4.5",
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "invoice_extraction",
                "strict": True,
                "schema": Invoice.model_json_schema(),
            },
        },
        messages=[
            {"role": "system", "content": "Extract invoice fields from OCR text. Return amounts in cents."},
            {"role": "user", "content": text},
        ],
    )

    return Invoice.model_validate_json(response.choices[0].message.content)

Three things to notice. First, the schema and the parser are the same Pydantic class — there is no possibility of drift. Second, the field_validator catches a business rule that JSON Schema cannot express: the total should equal the sum of the line items within a tolerance. If the OCR missed a line and the model produced a plausible but wrong total, this rule catches it and you can route the document to manual review. Third, the response_format uses Invoice.model_json_schema() directly, so any time you add a field or tighten a constraint, both the model and the validator learn about it in one change.

The same structure works with Claude Tool Use and with Gemini responseSchema; the only moving part is the provider-specific envelope around the schema. Teams typically wrap this in a small abstraction, select the provider per document type based on cost and accuracy benchmarks, and carry the same Pydantic models through the database layer. At that point, the distinction between “free-text LLM output” and “typed domain object” has disappeared entirely, which is the quiet revolution that strict-mode structured outputs have brought to production AI in 2026.

This article is part of our prompt-engineering series. The complete guide with all five core techniques: Prompt Engineering 2026. Related in-depth pieces:

Do structured outputs pay off in every project in 2026? Our concrete recommendation

In 2026, structured output generation is no longer a nice-to-have but the prerequisite for any serious LLM integration. With native JSON schemas, Pydantic or Zod validation and a small semantic-validation layer, you reach a success rate that was unthinkable two years ago — and the operational benefits of that reliability compound into every downstream system you build.

Sources and further reading

Schema-API and validation recommendations rest on the primary sources: the OpenAI Structured Outputs documentation describes strict mode and the JSON Schema subset, the Anthropic Tool Use documentation explains the tool-schema pattern, and the Gemini Structured Output documentation covers responseSchema. For the validation frameworks themselves see the Pydantic documentation and the Zod documentation.

Update note (as of 13.04.2026)

This guide is continuously reconciled with the schema APIs of the three leading vendors. Particular attention goes to the expansion of the JSON Schema subset in OpenAI’s Structured Outputs, standardisation of the Anthropic Tool Use pattern, new responseSchema variants in Gemini 2.5, and possible EU AI Act audit-log requirements for structured LLM outputs from 02.08.2026. Market-relevant interim events appear first as cluster updates on the hub.

Frequently Asked Questions

What are structured outputs in LLMs?

Instead of free text, the model produces a defined format — usually JSON, sometimes XML or CSV — that your code can parse directly. A must for production integrations that need to process LLM answers further.

What is JSON mode and which models support it?

JSON mode guarantees syntactically correct JSON in the output — no half brackets, no Markdown wrapping. Supported by GPT-4 Turbo, GPT-4o, GPT-4.5, Claude 3.5 (experimental), Gemini 2.0. Mistral and open-source models still have it less reliably in 2026.

When do I use JSON, when XML for the output?

JSON for code post-processing (almost always the choice). XML for long structured texts with clearly separated sections (e.g. cover letters with <introduction>, <body>, <conclusion>). Claude responds particularly well to XML structures because its training data was structured that way.

What is JSON Schema and why does it matter?

JSON Schema describes the allowed structure of a JSON document. OpenAI's Structured Outputs and Google Gemini use JSON Schema to guarantee structure compliance at the token level — no deviations. This replaces manual post-parsing and retry logic.

How do I handle broken LLM outputs?

Three-step strategy: (1) enable JSON mode. (2) Validate with Pydantic/Zod. (3) On failure: 'repair prompt' — the model receives the error and is asked to correct the output. In production, the retry limit is 2–3 attempts.

What is function calling and how does it relate to JSON mode?

Function calling is the special case: you define functions with parameters (JSON schema) and the model decides when to 'call' one. Internally it produces a JSON with function name and arguments. In 2026, this is the standard for AI agents and tool use.

Does JSON mode work on local open-source models?

With limitations. Ollama + Llama 3.2 support JSON mode since late 2024. Reliability is lower than OpenAI though — expect 85–95% correct JSON on Llama, 99.9% on GPT-4o. For production-critical use cases: closed-source models or Llama 3.3 70B+.

How do I validate LLM JSON outputs in Python and TypeScript?

Python: Pydantic (pydantic.BaseModel) — industry standard, de-facto in all frameworks in 2026. TypeScript: Zod (z.object) — very popular. Both allow strict validation with useful error messages for direct re-prompting.

Tool comparison

Live side-by-side comparison

All comparisons