Friday, April 24, 2026
Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

How to Build an Agentic Deep Reinforcement Learning System with Curriculum Progression, Adaptive Exploration, and Meta-Level UCB Planning


In this tutorial, we build an advanced agentic Deep Reinforcement Learning system that guides an agent to learn not only actions within an environment but also how to choose its own training strategies. We design a Dueling Double DQN learner, introduce a curriculum with increasing difficulty, and integrate multiple exploration modes that adapt as training evolves. Most importantly, we construct a meta-agent that plans, evaluates, and regulates the entire learning process, allowing us to experience how agency transforms reinforcement learning into a self-directed, strategic workflow. Check out the FULL CODES here.

!pip install -q gymnasium[classic-control] torch matplotlib


import gymnasium as gym
import numpy as np
import torch, torch.nn as nn, torch.optim as optim
from collections import deque, defaultdict
import math, random, matplotlib.pyplot as plt


random.seed(0); np.random.seed(0); torch.manual_seed(0)


class DuelingQNet(nn.Module):
   def __init__(self, obs_dim, act_dim):
       super().__init__()
       hidden = 128
       self.feature = nn.Sequential(
           nn.Linear(obs_dim, hidden),
           nn.ReLU(),
       )
       self.value_head = nn.Sequential(
           nn.Linear(hidden, hidden),
           nn.ReLU(),
           nn.Linear(hidden, 1),
       )
       self.adv_head = nn.Sequential(
           nn.Linear(hidden, hidden),
           nn.ReLU(),
           nn.Linear(hidden, act_dim),
       )


   def forward(self, x):
       h = self.feature(x)
       v = self.value_head(h)
       a = self.adv_head(h)
       return v + (a - a.mean(dim=1, keepdim=True))


class ReplayBuffer:
   def __init__(self, capacity=100000):
       self.buffer = deque(maxlen=capacity)
   def push(self, s,a,r,ns,d):
       self.buffer.append((s,a,r,ns,d))
   def sample(self, batch_size):
       batch = random.sample(self.buffer, batch_size)
       s,a,r,ns,d = zip(*batch)
       def to_t(x, dt): return torch.tensor(x, dtype=dt, device=device)
       return to_t(s,torch.float32), to_t(a,torch.long), to_t(r,torch.float32), to_t(ns,torch.float32), to_t(d,torch.float32)
   def __len__(self): return len(self.buffer)

We set up the core structure of our deep reinforcement learning system. We initialize the environment, create the dueling Q-network, and prepare the replay buffer to store transitions efficiently. As we establish these foundations, we prepare everything our agent needs to begin learning. Check out the FULL CODES here.

class DQNAgent:
   def __init__(self, obs_dim, act_dim, gamma=0.99, lr=1e-3, batch_size=64):
       self.q = DuelingQNet(obs_dim, act_dim).to(device)
       self.tgt = DuelingQNet(obs_dim, act_dim).to(device)
       self.tgt.load_state_dict(self.q.state_dict())
       self.buf = ReplayBuffer()
       self.opt = optim.Adam(self.q.parameters(), lr=lr)
       self.gamma = gamma
       self.batch_size = batch_size
       self.global_step = 0


   def _eps_value(self, step, start=1.0, end=0.05, decay=8000):
       return end + (start - end) * math.exp(-step/decay)


   def select_action(self, state, mode, strategy, softmax_temp=1.0):
       s = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
       with torch.no_grad():
           q_vals = self.q(s).cpu().numpy()[0]
       if mode == "eval":
           return int(np.argmax(q_vals)), None
       if strategy == "epsilon":
           eps = self._eps_value(self.global_step)
           if random.random() < eps:
               return random.randrange(len(q_vals)), eps
           return int(np.argmax(q_vals)), eps
       if strategy == "softmax":
           logits = q_vals / softmax_temp
           p = np.exp(logits - np.max(logits))
           p /= p.sum()
           return int(np.random.choice(len(q_vals), p=p)), None
       return int(np.argmax(q_vals)), None


   def train_step(self):
       if len(self.buf) < self.batch_size:
           return None
       s,a,r,ns,d = self.buf.sample(self.batch_size)
       with torch.no_grad():
           next_q_online = self.q(ns)
           next_actions = next_q_online.argmax(dim=1, keepdim=True)
           next_q_target = self.tgt(ns).gather(1, next_actions).squeeze(1)
           target = r + self.gamma * next_q_target * (1 - d)
       q_vals = self.q(s).gather(1, a.unsqueeze(1)).squeeze(1)
       loss = nn.MSELoss()(q_vals, target)
       self.opt.zero_grad()
       loss.backward()
       nn.utils.clip_grad_norm_(self.q.parameters(), 1.0)
       self.opt.step()
       return float(loss.item())


   def update_target(self):
       self.tgt.load_state_dict(self.q.state_dict())


   def run_episodes(self, env, episodes, mode, strategy):
       returns = []
       for _ in range(episodes):
           obs,_ = env.reset()
           done = False
           ep_ret = 0.0
           while not done:
               self.global_step += 1
               a,_ = self.select_action(obs, mode, strategy)
               nobs, r, term, trunc, _ = env.step(a)
               done = term or trunc
               if mode == "train":
                   self.buf.push(obs, a, r, nobs, float(done))
                   self.train_step()
               obs = nobs
               ep_ret += r
           returns.append(ep_ret)
       return float(np.mean(returns))


   def evaluate_across_levels(self, levels, episodes=5):
       scores = {}
       for name, max_steps in levels.items():
           env = gym.make("CartPole-v1", max_episode_steps=max_steps)
           avg = self.run_episodes(env, episodes, mode="eval", strategy="epsilon")
           env.close()
           scores[name] = avg
       return scores

We define how our agent observes the environment, chooses actions, and updates its neural network. We implement Double DQN logic, gradient updates, and exploration strategies that let the agent balance learning and discovery. As we finish this snippet, we equip our agent with its full low-level learning capabilities. Check out the FULL CODES here.

class MetaAgent:
   def __init__(self, agent):
       self.agent = agent
       self.levels = {
           "EASY": 100,
           "MEDIUM": 300,
           "HARD": 500,
       }
       self.plans = []
       for diff in self.levels.keys():
           for mode in ["train", "eval"]:
               for expl in ["epsilon", "softmax"]:
                   self.plans.append((diff, mode, expl))
       self.counts = defaultdict(int)
       self.values = defaultdict(float)
       self.t = 0
       self.history = []


   def _ucb_score(self, plan, c=2.0):
       n = self.counts[plan]
       if n == 0:
           return float("inf")
       return self.values[plan] + c * math.sqrt(math.log(self.t+1) / n)


   def select_plan(self):
       self.t += 1
       scores = [self._ucb_score(p) for p in self.plans]
       return self.plans[int(np.argmax(scores))]


   def make_env(self, diff):
       max_steps = self.levels[diff]
       return gym.make("CartPole-v1", max_episode_steps=max_steps)


   def meta_reward_fn(self, diff, mode, avg_return):
       r = avg_return
       if diff == "MEDIUM": r += 20
       if diff == "HARD": r += 50
       if mode == "eval" and diff == "HARD": r += 50
       return r


   def update_plan_value(self, plan, meta_reward):
       self.counts[plan] += 1
       n = self.counts[plan]
       mu = self.values[plan]
       self.values[plan] = mu + (meta_reward - mu) / n


   def run(self, meta_rounds=30):
       eval_log = {"EASY":[], "MEDIUM":[], "HARD":[]}
       for k in range(1, meta_rounds+1):
           diff, mode, expl = self.select_plan()
           env = self.make_env(diff)
           avg_ret = self.agent.run_episodes(env, 5 if mode=="train" else 3, mode, expl if mode=="train" else "epsilon")
           env.close()
           if k % 3 == 0:
               self.agent.update_target()
           meta_r = self.meta_reward_fn(diff, mode, avg_ret)
           self.update_plan_value((diff,mode,expl), meta_r)
           self.history.append((k, diff, mode, expl, avg_ret, meta_r))
           if mode == "eval":
               eval_log[diff].append((k, avg_ret))
           print(f"{k} {diff} {mode} {expl} {avg_ret:.1f} {meta_r:.1f}")
       return eval_log

We design the agentic layer that decides how the agent should train. We use a UCB bandit to select difficulty levels, modes, and exploration styles based on past performance. As we repeatedly run these choices, we observe the meta-agent strategically guiding the entire training process. Check out the FULL CODES here.

tmp_env = gym.make("CartPole-v1", max_episode_steps=100)
obs_dim, act_dim = tmp_env.observation_space.shape[0], tmp_env.action_space.n
tmp_env.close()


agent = DQNAgent(obs_dim, act_dim)
meta = MetaAgent(agent)


eval_log = meta.run(meta_rounds=36)


final_scores = agent.evaluate_across_levels(meta.levels, episodes=10)
print("Final Evaluation")
for k, v in final_scores.items():
   print(k, v)

We bring everything together by launching meta-rounds where the meta-agent selects plans and the DQN agent executes them. We track how performance evolves and how the agent adapts to increasingly difficult tasks. As this snippet runs, we see the emergence of long-horizon self-directed learning. Check out the FULL CODES here.

plt.figure(figsize=(9,4))
for diff, color in [("EASY","tab:blue"), ("MEDIUM","tab:orange"), ("HARD","tab:red")]:
   if eval_log[diff]:
       x, y = zip(*eval_log[diff])
       plt.plot(x, y, marker="o", label=f"{diff}")
plt.xlabel("Meta-Round")
plt.ylabel("Avg Return")
plt.title("Agentic Meta-Control Evaluation")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We visualize how the agent performs across Easy, Medium, and Hard tasks over time. We observe learning trends, improvements, and the effects of agentic planning reflected in the curves. As we analyze these plots, we gain insight into how strategic decisions shape the agent’s overall progress.

In conclusion, we observe our agent evolve into a system that learns on multiple levels, refining its policies, adjusting its exploration, and strategically selecting how to train itself. We observe the meta-agent refine its decisions through UCB-based planning and guide the low-level learner toward more challenging tasks and improved stability. With a deeper understanding of how agentic structures amplify reinforcement learning, we can create systems that plan, adapt, and optimize their own improvement over time.


Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.



Source link

Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.