Affiliate Disclosure: Some links on this page are affiliate links. If you purchase through them, we may earn a small commission — at no extra cost to you. These recommendations are independent and based on our own research.
Update history (2)
- Reasoning models (OpenAI o1/o3, Claude Thinking) integrated, impact on classic CoT documented, Tree-of-Thought and Graph-of-Thought added as extensions.
- Original publication with Zero-Shot-CoT, Few-Shot-CoT, Self-Consistency and concrete examples from math, code debugging and legal reasoning.
Chain-of-thought prompting was the breakthrough technique of 2022 that made language models genuinely useful for reasoning-heavy work. It is one of the seven core techniques from our Prompt Engineering 2026 guide, and this article goes deep on it. Four years later, the landscape has shifted: reasoning models handle much of this implicitly, while the classic techniques remain indispensable for anyone working with standard models or tight budgets. This guide walks through what still works in 2026, where the boundaries now sit, and how to pick the right approach for each kind of task.
Short answer
Chain-of-thought prompting in 2026: why “think aloud” still works
Language models are autoregressive: each token they produce is conditioned on every token that came before it, including the ones they generated themselves. That architectural fact is the entire reason chain-of-thought works. When you force the model to write intermediate steps before the final answer, those steps become part of the context that produces that answer. A model that has already written “5 − 2 = 3” on the page is far more likely to continue with “3 + 4 = 7” than a model asked to jump straight from the question to the number.
In 2022 this was a revelation. A team at Google Research showed that a single activator phrase — “Let’s think step by step” — could lift accuracy on grade-school math problems from around 17% to above 78% on PaLM 540B. Follow-up work across GSM8K, MultiArith, and the BIG-Bench suite reproduced gains of 20 to 50 percentage points on tasks that required multi-step reasoning. The effect was strongest on arithmetic, symbolic manipulation and logical inference, and weakest on tasks that were already well-calibrated in the training data, such as short factual recall.
By 2026 the picture has become more nuanced. Modern general-purpose models like GPT-4o and Claude 3.5 Sonnet often reason in intermediate steps by default when a question looks complex, so the explicit activator sometimes adds little. On the other end of the spectrum, purpose-built reasoning models such as OpenAI o1 and o3 and Claude Thinking mode perform their chain-of-thought internally and hide the trace from the user. For a working professional this means the question is no longer “should I use CoT” but “which flavor of CoT, and do I need it at all?” The answer depends on the model you can afford, the task you are solving, and how much latency you can tolerate.
There is one property of CoT that explains why it refuses to go away: interpretability. When the model writes its reasoning, you can read it. You can catch the step where it confused “per month” with “per year”, or where it silently switched the units, or where it accepted a premise that was never in the source document. Reasoning models deliver only the conclusion, which is faster but harder to audit. In regulated environments — finance, medicine, law, safety-critical engineering — a visible reasoning trace is often a compliance requirement, not a stylistic preference.
The three CoT variants: few-shot, zero-shot and self-consistency
Three patterns form the core vocabulary of chain-of-thought prompting, and almost every more exotic technique is a variation on these three. They differ in how much you put into the prompt and how many times you call the model.
Zero-shot CoT is the minimalist version. You ask your question and add an activator — “Let’s think step by step”, “Walk through your reasoning before answering”, or a structured instruction like “First analyze, then conclude”. The model is expected to know the pattern from its training data, and with any model released after GPT-3.5, it does. This is the technique with the best effort-to-value ratio. It costs nothing to add, and on tasks where CoT helps, the lift is immediate and measurable.
Few-shot CoT gives the model three to five worked examples with their reasoning already written out, then asks a fresh question. The examples calibrate three things at once: the format of the reasoning (free text vs. numbered steps vs. XML tags), the depth of reasoning (one sentence per step or a paragraph), and the style of conclusion (a single number, a JSON object, a natural-language verdict). Few-shot is the technique you reach for when zero-shot produces inconsistent output that you cannot parse programmatically.
Self-consistency is the heaviest of the three. You run the same CoT prompt at non-zero temperature — typically 0.7 — five or more times, then take a majority vote on the final answers. Because the reasoning paths diverge but the correct answer tends to be reached from multiple angles, the majority often converges on the truth even when individual runs fail. Self-consistency costs a multiple of a single query and adds no interpretability benefit, so it lives in a narrow niche: high-stakes problems with a single correct answer where a 2 to 5 percentage point accuracy bump matters more than the bill.
These three are not mutually exclusive. A real production prompt often combines few-shot calibration for format, a zero-shot activator for the final question, and self-consistency on the critical decisions. Everything that follows is a deeper look at how to wire each one up correctly.
Zero-shot CoT with “let’s think step by step” — what still works in 2026?
The original activator still works, but it is no longer the best phrasing in most cases. “Let’s think step by step” was designed for models that did not yet reason by default. Modern models often interpret it as permission to monologue, and you end up with a page of reasoning before a one-word answer. The activator has mutated into a family of more precise phrasings, each suited to a particular class of problem.
For arithmetic and word problems, the classic still works. A prompt like the following produces a clean trace on almost any model from GPT-3.5 onward:
A train leaves station A at 14:20 travelling at 80 km/h.
A second train leaves station B, 240 km away, at 14:50 travelling at
100 km/h toward A. At what time do they meet?
Let's think step by step.
For legal or policy reasoning, a more structured activator works better because it primes the model to separate facts, rules and application:
Task: decide whether the described conduct constitutes a breach of
Article 5(1)(b) GDPR.
Reason in three stages:
1. Identify the personal data and the processing purpose.
2. Apply the purpose-limitation principle from Article 5(1)(b).
3. Conclude with a yes/no plus one-sentence justification.
For data analysis, the activator should nudge the model to state assumptions before crunching numbers:
Here is a CSV sample. Before you answer the question, list every
assumption you are making about the data (date format, currency,
missing values). Then compute the answer.
Question: what was the year-over-year revenue change for Q1?
What no longer works reliably in 2026 is the bare “Let’s think step by step” appended to a vague question. If the prompt itself is underspecified, the reasoning trace will be confident and wrong — the model fills gaps with plausible-sounding assumptions and then reasons from them. The rule to internalize: CoT amplifies whatever clarity is already in your prompt. It does not create clarity on its own.
A second 2026-specific caveat concerns models that ship with a “concise mode” or an internal system prompt that discourages long answers. Some API variants of GPT-4o and the free tier of Claude will cut reasoning short even when asked for it. The fix is to either make the reasoning step an explicit part of the output schema (a JSON field called reasoning) or to call the model in the API variant that removes that constraint.
Few-shot CoT with 3–5 reasoning examples in the prompt
Few-shot CoT is the technique to reach for when you need the model’s reasoning to land in a specific format every time. Three to five examples is the sweet spot: one is too few to establish a pattern, ten starts eating context window and often harms performance. Each example pairs a question with a fully-written-out solution, and the final line is an unanswered question that the model completes.
A worked example from a real support-triage system:
You classify support tickets by root cause. For each ticket, reason
through the symptoms, then output one of: billing, auth, performance,
feature-request, bug.
Ticket: "I can't log in, it says invalid password but I just reset it."
Reasoning: the user attempted login, received an invalid-password
error after a password reset. This points to a stale session or
password propagation issue, both of which are authentication problems.
Category: auth
Ticket: "My dashboard takes 40 seconds to load since yesterday."
Reasoning: a sudden slowdown after previously working fast indicates
a performance regression, not a feature gap or a billing issue.
Category: performance
Ticket: "Why am I being charged twice this month?"
Reasoning: the user reports an unexpected charge. This is a billing
dispute, not a technical fault.
Category: billing
Ticket: "The export button does nothing when I click it on Firefox."
Reasoning:
The model now knows exactly where to write the reasoning, how long it should be, and which category vocabulary to draw from. Without the examples, it might write three paragraphs of reasoning or invent categories like “browser-bug”.
Example selection matters more than example count. The examples should span the variety of the task. If you are classifying into five categories, show at least three different ones. If the reasoning depth varies (some tickets are one-liners, some require reading logs), show both extremes. A common mistake is to pick three easy examples, which trains the model to expect easy cases and fail on hard ones.
Order also matters. Research from 2023 showed that later examples in the prompt are weighted more heavily by the model, so the most representative example should be last. For parity, keep the format of every example identical down to punctuation — one inconsistent newline will prompt the model to “fix” the format on its own.
Self-consistency: 5 sampling runs and majority voting
Self-consistency was introduced by a Google team in 2022 and remains the cheapest reliable way to buy extra accuracy on problems with a single correct answer. The mechanism is straightforward: run the same CoT prompt N times at a moderate temperature, extract the final answer from each run, and return the answer that appears most often. On the GSM8K math benchmark, self-consistency added roughly 15 percentage points on top of plain CoT with PaLM, and similar gains have been measured on more recent models, although the absolute uplift shrinks as the base model improves.
A minimal implementation in Python:
from collections import Counter
def self_consistent_answer(prompt, n=5, temperature=0.7):
answers = []
for _ in range(n):
response = call_llm(prompt, temperature=temperature)
answers.append(extract_final_answer(response))
return Counter(answers).most_common(1)[0][0]
The extract_final_answer function is the unglamorous half of self-consistency. If the model answers in free text, you need a parser that pulls out the final number, verdict or classification. This is where structured outputs earn their keep — ask for a JSON object with a final_answer field and the parser becomes one line.
Self-consistency shines in three scenarios. The first is quantitative reasoning with many valid intermediate paths: a math problem where five different algebraic manipulations all lead to the same number. The second is classification with ambiguous edge cases, where a single run might go either way but five runs plus a vote stabilize the decision. The third is safety-critical yes/no gating, where one wrong “yes” is more expensive than five API calls.
It fails in two scenarios that are worth naming. For open-ended generation — essays, emails, code — there is no “majority” to vote on, because every run produces a different text. And for tasks where the model is systematically biased toward one answer, self-consistency just confirms the bias more confidently; all five runs agree, and all five are wrong. The cure for the second case is to rephrase the prompt, not to add more runs.
Cost is the obvious tradeoff. At five runs, you pay five times the token bill and wait five times as long if you call sequentially. Running in parallel brings latency back to roughly one run, but the token cost is unchanged. For a typical CoT prompt producing 500 output tokens, five runs on GPT-4o in 2026 cost around half a cent. That is trivial for a critical business decision and prohibitive for a consumer chatbot at scale.
Reasoning models o1, o3, Claude Thinking: when classic CoT becomes obsolete
OpenAI’s o1 model shipped in late 2024, followed by o3 in early 2025. Anthropic introduced extended-thinking mode for Claude 3.5 Sonnet and later generations. Google added “deep thinking” to Gemini 2.0. These models share a common design: they generate a long internal chain of thought before producing a user-visible answer, and the user pays for the hidden reasoning tokens even though they never see them.
On benchmarks that reward multi-step reasoning, the results are striking. o1 went from GPT-4o’s 13% on AIME 2024 math olympiad problems to 83%. On the GPQA graduate-level physics benchmark, o3 exceeded 87% against PhD-level human performance around 70%. Claude Thinking shows similar gains on ARC-AGI and on complex coding benchmarks like SWE-Bench Verified, where it now solves over 49% of real open-source bugs end-to-end.
What this means practically: for tasks in the class where CoT historically gave the biggest lift, a reasoning model now delivers the same or better accuracy without any prompt engineering on your side. You ask the question directly, wait 10 to 60 seconds, and receive a polished answer. This has reshaped the decision tree.
In 2026, the rule of thumb looks like this. If your task is a hard reasoning problem — competitive math, formal logic, scientific derivation, architectural design — and your budget tolerates the latency and per-token cost, use a reasoning model and skip CoT entirely. If your task is medium-complex reasoning on a volume workload — thousands of support tickets, bulk document analysis, moderate-difficulty math tutoring — use a fast model like GPT-4o or Claude 3.5 Sonnet with explicit CoT. If your task is simple — lookup, summarization, reformatting — skip CoT altogether and use the fastest available model.
There is a hybrid mode worth knowing about: explicit CoT as an audit trail for regulated use cases. Reasoning models hide their internal trace; if your compliance regime requires an inspectable reasoning log, you may still want to run a fast model with visible CoT even when the reasoning model would be more accurate. Several financial-services teams in early 2026 reported exactly this pattern: o3 for internal analysis, GPT-4o with structured CoT for anything that goes into a file regulators might read.
A concrete comparison, using rough 2026 API prices for a reasoning-heavy 400-token-input query:
| Approach | Accuracy on hard math | Output tokens | Cost per query | Latency |
|---|---|---|---|---|
| Plain prompt on GPT-4o | ~40% | ~80 | $0.0003 | 1–2 s |
| Zero-shot CoT on GPT-4o | ~60% | ~400 | $0.0013 | 3–5 s |
| Self-consistency ×5 on GPT-4o | ~68% | ~2000 | $0.006 | 3–8 s parallel |
| o3 | ~85% | ~80 visible + ~3000 hidden | ~$0.04 | 20–60 s |
The reasoning model is roughly 30× more expensive per query than zero-shot CoT on a fast model, for a 25 percentage point accuracy gain on hard problems. Whether that trade is worth it depends entirely on how valuable a correct answer is in your context.
CoT for math, logic and planning tasks with concrete prompt examples
Math is the canonical CoT task because it has unambiguous ground truth. A prompt that reliably solves word problems on GPT-4o looks like this:
Solve the problem below. State each step as a short equation.
Finish with a line: Final answer: <number>.
Problem: A tank holds 1800 litres. Pump A fills it at 40 l/min, pump B
drains it at 15 l/min. If both run, how long until the tank is full,
starting from empty?
The trailing “Final answer:” line is the parsing anchor. Your extractor reads everything after those two words on the final line and converts it to a number. This format survives self-consistency cleanly: extract five numbers, take the mode.
Logic puzzles benefit from a slightly different framing because the model tends to rush. For a Knights-and-Knaves problem, the effective prompt structure is:
You will solve a logic puzzle. Before answering, list every inhabitant
and write down "knight" or "knave" as a hypothesis. Test each
hypothesis against every statement. Only after all statements have
been checked do you state the final assignment.
Puzzle: A says "B is a knave." B says "A and I are both knights." ...
Explicitly asking the model to test hypotheses rather than produce them is the key. Without that, the model commits to an early guess and then writes reasoning that rationalises it — the same failure mode humans have.
Planning tasks — project plans, travel itineraries, research outlines — are trickier because they have no single correct answer. Here CoT helps by forcing the model to enumerate constraints first. The pattern:
Plan a three-day research trip to Kyoto for two travellers.
Step 1: list every hard constraint (dates, budget, mobility needs,
must-see sites).
Step 2: list every soft preference (food, pace, cultural vs. nature).
Step 3: allocate sites to days, respecting opening hours and travel
time between locations.
Step 4: output the final itinerary as a table.
For deep planning work, pairing CoT with ChatGPT in a multi-turn conversation where you correct the constraints after step 1 tends to produce dramatically better plans than a single-shot prompt.
CoT for code generation and bug analysis
Code is where CoT interacts most interestingly with 2026 tooling. Generating working code from a specification is a reasoning problem: the model must infer types, anticipate edge cases and keep state consistent across many lines. A bare prompt like “write a function that merges two sorted lists” works for trivia, but fails on anything real.
The pattern that lifts quality noticeably on both GPT-4o and Claude 3.5 Sonnet is to ask for a plan before the code:
Task: write a Python function that parses a CSV export from Bank X,
filters transactions by category, and returns a pandas DataFrame with
a running balance column.
Before writing code:
1. List the expected CSV columns and their types.
2. Describe the filtering logic in plain English.
3. Note any edge cases (empty file, negative amounts, dates in two
formats).
Then write the function with inline comments that match the plan.
The plan section is CoT. The code section benefits from having just read that plan as its own context.
Bug analysis is CoT’s strongest domain in programming. Paste a stack trace plus the offending function, and ask the model to walk through the execution:
Here is a stack trace and the function that produced it. Trace the
execution line by line, naming the value of every variable at each
step, and identify the first line where state diverges from what the
caller expects.
[stack trace]
[function body]
This reliably finds off-by-one errors, null-handling bugs and misinterpreted API contracts, and it does so in a form that a human can verify. For the same task, a reasoning model will often find the bug faster but show only the conclusion. If the bug is subtle and you want to learn from the analysis, CoT on a fast model beats a black-box reasoning answer.
The limits of CoT: when more reasoning delivers less
CoT is not free accuracy. There are task classes where it actively hurts, and recognizing them early saves wasted effort.
Simple factual recall is the clearest case. “What is the capital of France?” does not benefit from reasoning steps, and asking for them can prompt the model to manufacture spurious justifications. Worse, on tasks where the model is confidently wrong, CoT produces a confident-looking wrong reasoning trace that reads more credibly than a blunt wrong answer would.
Creative writing is the second case. A short story, a marketing tagline or a poem does not have intermediate steps in the sense CoT assumes. Asking the model to reason through a poem line by line before writing it tends to produce stilted output.
Style transfer and tone adjustment are a third case. “Rewrite this email in a friendlier tone” is a single-step task the model performs holistically. CoT adds noise without improving the result.
There is also a more subtle failure mode called “reasoning hallucination”. On genuinely hard problems that are slightly beyond the model’s capability, a CoT prompt will produce a long, plausible-sounding trace that looks right but reaches a wrong conclusion. Because each intermediate step seems locally valid, the output is harder to spot-check than a blunt wrong answer. The defence is to verify against a ground truth — run the generated code, check the arithmetic with a calculator, cross-reference the cited source.
Finally, over-engineered CoT prompts with six nested sections can degrade output. The model spends output budget navigating your scaffold instead of solving the problem. A good diagnostic: if your prompt is longer than the answer you expect, the prompt is probably too long.
Cost: tokens per CoT prompt and when to switch to reasoning models
Tokens are the ledger in which CoT is paid for. A plain answer to a reasoning question is typically 50 to 150 output tokens. A zero-shot CoT answer is 300 to 800. A few-shot CoT answer adds the example tokens to the input bill — usually another 500 to 1500 depending on example length. Self-consistency multiplies the output bill by the number of runs.
On GPT-4o in 2026, output tokens cost about $2.50 per million, and input tokens about $0.625 per million. A typical zero-shot CoT query at 400 output tokens costs $0.001. Run it five times for self-consistency and you pay $0.005. Few-shot adds around $0.001 per 1500 input tokens. For comparison, an o3 query of similar complexity, with hidden reasoning tokens included, runs around $0.03 to $0.05 per query at current pricing.
Three cost-aware patterns have become standard. The first is “cheap first, expensive on fallback”: run GPT-4o with CoT, and if the answer fails validation (the code does not compile, the math does not check out), retry on o3. Most queries resolve on the cheap tier. The second is “batch reasoning”: collect questions offline and run them overnight on o3 rather than paying for real-time reasoning per user click. The third is “prompt caching”: for few-shot prompts where the examples are stable, modern APIs charge 10% of the normal rate on cached input tokens, which makes long example blocks economically viable.
Latency often matters more than dollar cost. Zero-shot CoT on GPT-4o adds 2 to 4 seconds over a plain answer. Self-consistency in parallel stays in the same range. A reasoning model adds 20 to 60 seconds, and for some complex o3 queries several minutes. A user waiting live on a screen tolerates 5 seconds; they do not tolerate a minute. This alone is often the deciding constraint in product design.
The cost comparison also affects where you put reasoning in a multi-step pipeline. A common 2026 pattern: use cheap CoT on a fast model for the many routine decisions, and reserve the reasoning model for the one or two nodes in the pipeline where accuracy dominates. This is discussed in more depth in the prompt engineering 2026 guide, which treats prompt cost as a first-class design variable.
Tree-of-Thought and Graph-of-Thought as extensions for complex problems
Chain-of-thought assumes a linear reasoning path: step one, step two, step three, answer. Many real problems are not linear. A chess position has multiple plausible moves, each leading to branching counter-moves. A research question may have several hypotheses worth exploring in parallel. Two extensions generalize CoT for these structures.
Tree-of-Thought, introduced by Yao et al. in 2023, treats each reasoning step as a node that can branch into several alternatives. The model generates three or five candidate next-steps, evaluates them against a criterion, keeps the promising ones and expands them further. The result is a search tree, and the final answer is the best leaf. In practice, Tree-of-Thought is implemented either through multiple API calls orchestrated by application code, or through a single very long prompt that asks the model to simulate the tree internally. The former is expensive but gives you explicit control; the latter is cheaper but relies on the model not losing the thread.
On the Game of 24 puzzle benchmark, Tree-of-Thought raised GPT-4’s success rate from around 4% with plain CoT to 74%. Similar gains appear on creative writing evaluations where branching helps explore narrative options. The cost, unsurprisingly, is high: a Tree-of-Thought run with branching factor five and depth four evaluates dozens of candidate paths.
Graph-of-Thought generalizes further. Instead of a tree, nodes can merge — two reasoning paths converge on the same subproblem, and the answer is reused. This matches how humans actually solve complex problems: we make lemmas, prove them once and apply them in several places. Graph-of-Thought implementations typically require an external graph structure managed by code, with the model invoked for each node expansion.
In 2026, Tree-of-Thought and Graph-of-Thought remain research-grade techniques for most teams. The operational overhead is high, the library ecosystem is thin, and reasoning models have absorbed many of the use cases where ToT used to dominate. The places where ToT still earns its keep are long-horizon planning (multi-day project roadmaps, multi-step game strategy, chemical-synthesis routing) and agent loops where an LLM controls many tool calls. For everything else, zero-shot CoT plus optional self-consistency remains the practical default.
When does Chain-of-Thought still pay off in 2026? Our concrete recommendation
Chain-of-thought is no longer the holy grail it was in 2022, but it is also not obsolete. In 2026 it occupies a specific and valuable niche: the best accuracy-per-dollar for reasoning tasks on general-purpose models, the most auditable reasoning trace for regulated work, and the fastest path to consistent structured output when paired with JSON schemas. Reasoning models like o1, o3 and Claude Thinking take over at the top end, where accuracy matters more than cost or latency. Self-consistency remains the right tool when one wrong answer is more expensive than five API calls. And Tree-of-Thought sits on the horizon for the minority of problems that cannot be expressed as a single line of reasoning. Pick the level of reasoning that matches your task, your budget and your latency tolerance — and resist the reflex to add reasoning to every prompt just because you can.
Sources and further reading
Claims about CoT effect and reasoning models rest on the primary literature: the original “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” paper is on arXiv, “Self-Consistency Improves Chain of Thought Reasoning” also on arXiv. On the vendor side, the OpenAI reasoning documentation covers the o1/o3 models and the Anthropic extended-thinking documentation covers Claude Thinking. A compact overview of all CoT variants lives in the Prompt Engineering Guide (DAIR.AI).
The full introduction with all core techniques is in the parent Prompt Engineering 2026 guide. How CoT relates to the example strategy is covered in Few-Shot vs. Zero-Shot Prompting; for model-specific tuning the ChatGPT, Claude and Gemini tool pages document the idioms each model responds to best.
Update note (as of 12.04.2026)
This guide is continuously reconciled with the reasoning-model moves of the three leading vendors. Particular attention goes to the expansion of o1/o3 to further tiers, the extended thinking mode in Claude Opus 4, possible Tree-of-Thought APIs and pricing changes for reasoning tokens, and EU AI Act requirements for reproducible reasoning traces from 02.08.2026. Market-relevant interim events appear first as cluster updates on the hub.
Related articles
Our central articles on Artificial Intelligence at a glance — sorted chronologically.
Frequently Asked Questions
What is Chain-of-Thought prompting in simple terms?
Instead of a direct answer, you ask the model to think out loud and show intermediate steps — similar to how a human does a calculation on paper before stating the final number. This significantly reduces errors on logical and mathematical tasks.
When is Chain-of-Thought useful — and when not?
Very effective: math, logic puzzles, legal argumentation, multi-step data analysis. Low value: simple facts, creative texts, stylistic adaptation. Rule of thumb: useful whenever a human would also need a sheet of paper to solve it.
How do I phrase a Zero-Shot-CoT prompt?
Add an activator at the end of your prompt: 'Let's think step by step.' On modern LLMs (GPT-4, Claude 3.5) this alone is often enough to measurably improve accuracy.
What is the difference between CoT and a reasoning model?
CoT is a prompt technique — the model externalizes its intermediate steps. Reasoning models (OpenAI o1, o3, Claude 3.5 Sonnet Thinking) have reasoning baked in internally, often running for several seconds, and show you only the final result. With reasoning models, CoT is often redundant.
What is Self-Consistency and when should I use it?
You run the same CoT prompt three to five times and take the most frequent answer. That compensates for the probabilistic nature of LLMs — especially for critical decisions with one clear solution.
How much more expensive does CoT get in API usage?
Significantly — because more tokens are generated. Expect 3–5x output tokens. With Self-Consistency additionally multiplied by the number of runs. Concrete math: GPT-4o in 2026 costs around $2.50 per million output tokens — five CoT runs at 500 tokens each cost about $0.006 per query.
What are the most common CoT mistakes?
The top three: (1) 'Let's think step by step' without concrete framing — often no longer needed, and sometimes leads to endless monologues. (2) CoT on simple factual queries — creates noise. (3) No structured closing prompt ('summarize in one sentence') — output becomes rambling.
What comes after CoT — which technique will replace it?
Tree-of-Thoughts (ToT) is the next step: the model explores multiple reasoning paths in parallel and picks the best one. Still too expensive for daily enterprise use. In 2026 we are in the phase where CoT is the standard, ToT reserved for critical individual decisions.









