Wednesday, April 8, 2026
Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

Implementing Softmax From Scratch: Avoiding the Numerical Stability Trap


In deep learning, classification models don’t just need to make predictions—they need to express confidence. That’s where the Softmax activation function comes in. Softmax takes the raw, unbounded scores produced by a neural network and transforms them into a well-defined probability distribution, making it possible to interpret each output as the likelihood of a specific class. 

This property makes Softmax a cornerstone of multi-class classification tasks, from image recognition to language modeling. In this article, we’ll build an intuitive understanding of how Softmax works and why its implementation details matter more than they first appear. Check out the FULL CODES here.

Implementing Naive Softmax

import torch

def softmax_naive(logits):
    exp_logits = torch.exp(logits)
    return exp_logits / exp_logits.sum(dim=1, keepdim=True)

This function implements the Softmax activation in its most straightforward form. It exponentiates each logit and normalizes it by the sum of all exponentiated values across classes, producing a probability distribution for each input sample. 

While this implementation is mathematically correct and easy to read, it is numerically unstable—large positive logits can cause overflow, and large negative logits can underflow to zero. As a result, this version should be avoided in real training pipelines. Check out the FULL CODES here.

Sample Logits and Target Labels

This example defines a small batch with three samples and three classes to illustrate both normal and failure cases. The first and third samples contain reasonable logit values and behave as expected during Softmax computation. The second sample intentionally includes extreme values (1000 and -1000) to demonstrate numerical instability—this is where the naive Softmax implementation breaks down. 

The targets tensor specifies the correct class index for each sample and will be used to compute the classification loss and observe how instability propagates during backpropagation. Check out the FULL CODES here.

# Batch of 3 samples, 3 classes
logits = torch.tensor([
    [2.0, 1.0, 0.1],      
    [1000.0, 1.0, -1000.0],  
    [3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

Forward Pass: Softmax Output and the Failure Case

During the forward pass, the naive Softmax function is applied to the logits to produce class probabilities. For normal logit values (first and third samples), the output is a valid probability distribution where values lie between 0 and 1 and sum to 1. 

However, the second sample clearly exposes the numerical issue: exponentiating 1000 overflows to infinity, while -1000 underflows to zero. This results in invalid operations during normalization, producing NaN values and zero probabilities. Once NaN appears at this stage, it contaminates all subsequent computations, making the model unusable for training. Check out the FULL CODES here.

# Forward pass
probs = softmax_naive(logits)

print("Softmax probabilities:")
print(probs)

Target Probabilities and Loss Breakdown

Here, we extract the predicted probability corresponding to the true class for each sample. While the first and third samples return valid probabilities, the second sample’s target probability is 0.0, caused by numerical underflow in the Softmax computation. When the loss is calculated using -log(p), taking the logarithm of 0.0 results in +∞

This makes the overall loss infinite, which is a critical failure during training. Once the loss becomes infinite, gradient computation becomes unstable, leading to NaNs during backpropagation and effectively halting learning. Check out the FULL CODES here.

# Extract target probabilities
target_probs = probs[torch.arange(len(targets)), targets]

print("\nTarget probabilities:")
print(target_probs)

# Compute loss
loss = -torch.log(target_probs).mean()
print("\nLoss:", loss)

Backpropagation: Gradient Corruption

When backpropagation is triggered, the impact of the infinite loss becomes immediately visible. The gradients for the first and third samples remain finite because their Softmax outputs were well-behaved. However, the second sample produces NaN gradients across all classes due to the log(0) operation in the loss. 

These NaNs propagate backward through the network, contaminating weight updates and effectively breaking training. This is why numerical instability at the Softmax–loss boundary is so dangerous—once NaNs appear, recovery is nearly impossible without restarting training. Check out the FULL CODES here.

loss.backward()

print("\nGradients:")
print(logits.grad)

Numerical Instability and Its Consequences

Separating Softmax and cross-entropy creates a serious numerical stability risk due to exponential overflow and underflow. Large logits can push probabilities to infinity or zero, causing log(0) and leading to NaN gradients that quickly corrupt training. At production scale, this is not a rare edge case but a certainty—without stable, fused implementations, large multi-GPU training runs would fail unpredictably. 

The core numerical problem comes from the fact that computers cannot represent infinitely large or infinitely small numbers. Floating-point formats like FP32 have strict limits on how big or small a value can be stored. When Softmax computes exp(x), large positive values grow so fast that they exceed the maximum representable number and turn into infinity, while large negative values shrink so much that they become zero. Once a value becomes infinity or zero, subsequent operations like division or logarithms break down and produce invalid results. Check out the FULL CODES here.

Implementing Stable Cross-Entropy Loss Using LogSumExp

This implementation computes cross-entropy loss directly from raw logits without explicitly calculating Softmax probabilities. To maintain numerical stability, the logits are first shifted by subtracting the maximum value per sample, ensuring exponentials stay within a safe range. 

The LogSumExp trick is then used to compute the normalization term, after which the original (unshifted) target logit is subtracted to obtain the correct loss. This approach avoids overflow, underflow, and NaN gradients, and mirrors how cross-entropy is implemented in production-grade deep learning frameworks. Check out the FULL CODES here.

def stable_cross_entropy(logits, targets):

    # Find max logit per sample
    max_logits, _ = torch.max(logits, dim=1, keepdim=True)

    # Shift logits for numerical stability
    shifted_logits = logits - max_logits

    # Compute LogSumExp
    log_sum_exp = torch.log(torch.sum(torch.exp(shifted_logits), dim=1)) + max_logits.squeeze(1)

    # Compute loss using ORIGINAL logits
    loss = log_sum_exp - logits[torch.arange(len(targets)), targets]

    return loss.mean()

Stable Forward and Backward Pass

Running the stable cross-entropy implementation on the same extreme logits produces a finite loss and well-defined gradients. Even though one sample contains very large values (1000 and -1000), the LogSumExp formulation keeps all intermediate computations in a safe numerical range. As a result, backpropagation completes successfully without producing NaNs, and each class receives a meaningful gradient signal. 

This confirms that the instability seen earlier was not caused by the data itself, but by the naive separation of Softmax and cross-entropy—an issue fully resolved by using a numerically stable, fused loss formulation. Check out the FULL CODES here.

logits = torch.tensor([
    [2.0, 1.0, 0.1],
    [1000.0, 1.0, -1000.0],
    [3.0, 2.0, 1.0]
], requires_grad=True)

targets = torch.tensor([0, 2, 1])

loss = stable_cross_entropy(logits, targets)
print("Stable loss:", loss)

loss.backward()
print("\nGradients:")
print(logits.grad)

Conclusion

In practice, the gap between mathematical formulas and real-world code is where many training failures originate. While Softmax and cross-entropy are mathematically well-defined, their naive implementation ignores the finite precision limits of IEEE 754 hardware, making underflow and overflow inevitable. 

The key fix is simple but critical: shift logits before exponentiation and operate in the log domain whenever possible. Most importantly, training rarely requires explicit probabilities—stable log-probabilities are sufficient and far safer. When a loss suddenly turns into NaN in production, it’s often a signal that Softmax is being computed manually somewhere it shouldn’t be.


Check out the FULL CODES here. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Check out our latest release of ai2025.dev, a 2025-focused analytics platform that turns model launches, benchmarks, and ecosystem activity into a structured dataset you can filter, compare, and export


I am a Civil Engineering Graduate (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in Data Science, especially Neural Networks and their application in various areas.



Source link

Latest Posts

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.