Thursday, February 5, 2026
Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

How to Build Efficient Agentic Reasoning Systems by Dynamically Pruning Multiple Chain-of-Thought Paths Without Losing Accuracy


In this tutorial, we implement an agentic chain-of-thought pruning framework that generates multiple reasoning paths in parallel and dynamically reduces them using consensus signals and early stopping. We focus on improving reasoning efficiency by reducing unnecessary token usage while preserving answer correctness, demonstrating that self-consistency and lightweight graph-based agreement can serve as effective proxies for reasoning quality. We design the entire pipeline using a compact instruction-tuned model and progressive sampling to simulate how an agent can decide when it has reasoned “enough.” Check out the FULL CODES here.

!pip -q install -U transformers accelerate bitsandbytes networkx scikit-learn


import re, time, random, math
import numpy as np
import torch
import networkx as nx
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


SEED = 7
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


MODEL_NAME = "Qwen/Qwen2.5-0.5B-Instruct"


tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=torch.float16,
   load_in_4bit=True
)
model.eval()


SYSTEM = "You are a careful problem solver. Keep reasoning brief and output a final numeric answer."
FINAL_RE = re.compile(r"Final:\s*([-\d]+(?:\.\d+)?)")

We set up the Colab environment and load all required libraries for efficient agentic reasoning. We initialize a lightweight instruction-tuned language model with quantization to ensure stable execution on limited GPU resources. We also define global configuration, randomness control, and the core prompting pattern used throughout the tutorial. Check out the FULL CODES here.

def make_prompt(q):
   return (
       f"{SYSTEM}\n\n"
       f"Problem: {q}\n"
       f"Reasoning: (brief)\n"
       f"Final: "
   )


def parse_final_number(text):
   m = FINAL_RE.search(text)
   if m:
       return m.group(1).strip()
   nums = re.findall(r"[-]?\d+(?:\.\d+)?", text)
   return nums[-1] if nums else None


def is_correct(pred, gold):
   if pred is None:
       return 0
   try:
       return int(abs(float(pred) - float(gold)) < 1e-9)
   except:
       return int(str(pred).strip() == str(gold).strip())


def tok_len(text):
   return len(tokenizer.encode(text))

We define helper functions that structure prompts, extract final numeric answers, and evaluate correctness against ground truth. We standardize how answers are parsed so that different reasoning paths can be compared consistently. We also introduce token-counting utilities that allow us to later measure reasoning efficiency. Check out the FULL CODES here.

@torch.no_grad()
def generate_paths(question, n, max_new_tokens=64, temperature=0.7, top_p=0.9):
   prompt = make_prompt(question)
   inputs = tokenizer(prompt, return_tensors="pt").to(model.device)


   gen_cfg = GenerationConfig(
       do_sample=True,
       temperature=temperature,
       top_p=top_p,
       max_new_tokens=max_new_tokens,
       pad_token_id=tokenizer.eos_token_id,
       eos_token_id=tokenizer.eos_token_id,
       num_return_sequences=n
   )


   out = model.generate(**inputs, generation_config=gen_cfg)
   prompt_tok = inputs["input_ids"].shape[1]


   paths = []
   for i in range(out.shape[0]):
       seq = out[i]
       gen_ids = seq[prompt_tok:]
       completion = tokenizer.decode(gen_ids, skip_special_tokens=True)
       paths.append({
           "prompt_tokens": int(prompt_tok),
           "gen_tokens": int(gen_ids.shape[0]),
           "completion": completion
       })
   return paths

We implement fast multi-sample generation that produces several reasoning paths in a single model call. We extract only the generated continuation to isolate the reasoning output for each path. We store token usage and completions in a structured format to support downstream pruning decisions. Check out the FULL CODES here.

def consensus_strength(completions, sim_threshold=0.22):
   if len(completions) <= 1:
       return [0.0] * len(completions)


   vec = TfidfVectorizer(ngram_range=(1,2), max_features=2500)
   X = vec.fit_transform(completions)
   S = cosine_similarity(X)


   G = nx.Graph()
   n = len(completions)
   G.add_nodes_from(range(n))


   for i in range(n):
       for j in range(i+1, n):
           w = float(S[i, j])
           if w >= sim_threshold:
               G.add_edge(i, j, weight=w)


   strength = [0.0] * n
   for u, v, d in G.edges(data=True):
       w = float(d.get("weight", 0.0))
       strength[u] += w
       strength[v] += w


   return strength

We construct a lightweight consensus mechanism using a similarity graph over generated reasoning paths. We compute pairwise similarity scores and convert them into a graph-based strength signal for each path. It allows us to approximate agreement between reasoning trajectories without expensive model calls. Check out the FULL CODES here.

def pick_final_answer(paths):
   answers = [parse_final_number(p["completion"]) for p in paths]
   strengths = consensus_strength([p["completion"] for p in paths])


   groups = {}
   for i, a in enumerate(answers):
       if a is None:
           continue
       groups.setdefault(a, {"idx": [], "strength": 0.0, "tokens": 0})
       groups[a]["idx"].append(i)
       groups[a]["strength"] += strengths[i]
       groups[a]["tokens"] += paths[i]["gen_tokens"]


   if not groups:
       return None, {"answers": answers, "strengths": strengths}


   ranked = sorted(
       groups.items(),
       key=lambda kv: (len(kv[1]["idx"]), kv[1]["strength"], -kv[1]["tokens"]),
       reverse=True
   )


   best_answer = ranked[0][0]
   best_indices = ranked[0][1]["idx"]
   best_i = sorted(best_indices, key=lambda i: (paths[i]["gen_tokens"], -strengths[i]))[0]


   return best_answer, {"answers": answers, "strengths": strengths, "best_i": best_i}


def pruned_agent_answer(
   question,
   batch_size=2,
   k_max=10,
   max_new_tokens=64,
   temperature=0.7,
   top_p=0.9,
   stop_min_samples=4,
   stop_ratio=0.67,
   stop_margin=2
):
   paths = []
   prompt_tokens_once = tok_len(make_prompt(question))
   total_gen_tokens = 0


   while len(paths) < k_max:
       n = min(batch_size, k_max - len(paths))
       new_paths = generate_paths(
           question,
           n=n,
           max_new_tokens=max_new_tokens,
           temperature=temperature,
           top_p=top_p
       )
       paths.extend(new_paths)
       total_gen_tokens += sum(p["gen_tokens"] for p in new_paths)


       if len(paths) >= stop_min_samples:
           answers = [parse_final_number(p["completion"]) for p in paths]
           counts = {}
           for a in answers:
               if a is None:
                   continue
               counts[a] = counts.get(a, 0) + 1
           if counts:
               sorted_counts = sorted(counts.items(), key=lambda kv: kv[1], reverse=True)
               top_a, top_c = sorted_counts[0]
               second_c = sorted_counts[1][1] if len(sorted_counts) > 1 else 0
               if top_c >= math.ceil(stop_ratio * len(paths)) and (top_c - second_c) >= stop_margin:
                   final, dbg = pick_final_answer(paths)
                   return {
                       "final": final,
                       "paths": paths,
                       "early_stopped_at": len(paths),
                       "tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
                       "debug": dbg
                   }


   final, dbg = pick_final_answer(paths)
   return {
       "final": final,
       "paths": paths,
       "early_stopped_at": None,
       "tokens_total": int(prompt_tokens_once * len(paths) + total_gen_tokens),
       "debug": dbg
   }

We implement the core agentic pruning logic that groups reasoning paths by final answers and ranks them using consensus and efficiency signals. We introduce progressive sampling with early stopping to terminate generation once sufficient confidence emerges. We then select a final answer that balances agreement strength and minimal token usage. Check out the FULL CODES here.

def baseline_answer(question, k=10, max_new_tokens=64):
   paths = generate_paths(question, n=k, max_new_tokens=max_new_tokens)
   prompt_tokens_once = tok_len(make_prompt(question))
   total_gen_tokens = sum(p["gen_tokens"] for p in paths)


   answers = [parse_final_number(p["completion"]) for p in paths]
   counts = {}
   for a in answers:
       if a is None:
           continue
       counts[a] = counts.get(a, 0) + 1
   final = max(counts.items(), key=lambda kv: kv[1])[0] if counts else None


   return {
       "final": final,
       "paths": paths,
       "tokens_total": int(prompt_tokens_once * k + total_gen_tokens)
   }


DATA = [
   {"q": "If a store sells 3 notebooks for $12, how much does 1 notebook cost?", "a": "4"},
   {"q": "What is 17*6?", "a": "102"},
   {"q": "A rectangle has length 9 and width 4. What is its area?", "a": "36"},
   {"q": "If you buy 5 apples at $2 each, how much do you pay?", "a": "10"},
   {"q": "What is 144 divided by 12?", "a": "12"},
   {"q": "If x=8, what is 3x+5?", "a": "29"},
   {"q": "A jar has 30 candies. You eat 7. How many remain?", "a": "23"},
   {"q": "If a train travels 60 km in 1.5 hours, what is its average speed (km/h)?", "a": "40"},
   {"q": "Compute: (25 - 9) * 3", "a": "48"},
   {"q": "What is the next number in the pattern: 2, 4, 8, 16, ?", "a": "32"},
]


base_acc, base_tok = [], []
prun_acc, prun_tok = [], []


for item in DATA:
   b = baseline_answer(item["q"], k=8, max_new_tokens=56)
   base_acc.append(is_correct(b["final"], item["a"]))
   base_tok.append(b["tokens_total"])


   p = pruned_agent_answer(item["q"], max_new_tokens=56)
   prun_acc.append(is_correct(p["final"], item["a"]))
   prun_tok.append(p["tokens_total"])


print("Baseline accuracy:", float(np.mean(base_acc)))
print("Baseline avg tokens:", float(np.mean(base_tok)))
print("Pruned accuracy:", float(np.mean(prun_acc)))
print("Pruned avg tokens:", float(np.mean(prun_tok)))

We compare the pruned agentic approach against a fixed self-consistency baseline. We evaluate both methods on accuracy and token consumption to quantify the efficiency gains from pruning. We conclude by reporting aggregate metrics that demonstrate how dynamic pruning preserves correctness while reducing reasoning cost.

In conclusion, we demonstrated that agentic pruning can significantly reduce effective token consumption without sacrificing accuracy by stopping reasoning once sufficient consensus emerges. We showed that combining self-consistency, similarity-based consensus graphs, and early-stop heuristics provides a practical and scalable approach to reasoning efficiency in agentic systems. This framework serves as a foundation for more advanced agentic behaviors, such as mid-generation pruning, budget-aware reasoning, and adaptive control over reasoning depth in real-world AI agents.


Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.




Source link

Latest Posts

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.