Saturday, April 4, 2026
Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

5 Types of Loss Functions in Machine Learning


A loss function is what guides a model during training, translating predictions into a signal it can improve on. But not all losses behave the same—some amplify large errors, others stay stable in noisy settings, and each choice subtly shapes how learning unfolds.

Modern libraries add another layer with reduction modes and scaling effects that influence optimization. In this article, we break down the major loss families and how to choose the right one for your task. 

Mathematical Foundations of Loss Functions

In supervised learning, the objective is typically to minimize the empirical risk,

 (often with optional sample weights and regularization).  

where ℓ is the loss function, fθ(xi) is the model prediction, and yi is the true target. In practice, this objective may also include sample weights and regularization terms. Most machine learning frameworks follow this formulation by computing per-example losses and then applying a reduction such as mean, sum, or none. 

When discussing mathematical properties, it is important to state the variable with respect to which the loss is analyzed. Many loss functions are convex in the prediction or logit for a fixed label, although the overall training objective is usually non-convex in neural network parameters. Important properties include convexity, differentiability, robustness to outliers, and scale sensitivity. Common implementation of pitfalls includes confusing logits with probabilities and using a reduction that does not match the intended mathematical definition. 

Flowchart

Regression Losses

Mean Squared Error 

Mean Squared Error, or MSE, is one of the most widely used loss functions for regression. It is defined as the average of the squared differences between predicted values and true targets: 

Mean Squared Error

Because the error term is squared, large residuals are penalized more heavily than small ones. This makes MSE useful when large prediction errors should be strongly discouraged. It is convex in the prediction and differentiable everywhere, which makes optimization straightforward. However, it is sensitive to outliers, since a single extreme residual can strongly affect the loss. 

import numpy as np
import matplotlib.pyplot as plt

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

mse = np.mean((y_true - y_pred) ** 2)
print("MSE:", mse)
Mean Squared Error

Mean Absolute Error 

Mean Absolute Error, or MAE, measures the average absolute difference between predictions and targets: 

Mean Absolute Error

Unlike MSE, MAE penalizes errors linearly rather than quadratically. As a result, it is more robust to outliers. MAE is convex in the prediction, but it is not differentiable at zero residual, so optimization typically uses subgradients at that point. 

import numpy as np  

y_true = np.array([3.0, -0.5, 2.0, 7.0])  
y_pred = np.array([2.5, 0.0, 2.0, 8.0])  

mae = np.mean(np.abs(y_true - y_pred))  

print("MAE:", mae)
Mean Absolute Error

Huber Loss 

Huber loss combines the strengths of MSE and MAE by behaving quadratically for small errors and linearly for large ones. For a threshold δ>0, it is defined as:

Huber Loss 

This makes Huber loss a good choice when the data are mostly well behaved but may contain occasional outliers. 

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

error = y_pred - y_true
delta = 1.0

huber = np.mean(
    np.where(
        np.abs(error) <= delta,
        0.5 * error**2,
        delta * (np.abs(error) - 0.5 * delta)
    )
)

print("Huber Loss:", huber)
Huber Loss 

Smooth L1 Loss 

Smooth L1 loss is closely related to Huber loss and is commonly used in deep learning, especially in object detection and regression heads. It transitions from a squared penalty near zero to an absolute penalty beyond a threshold. It is differentiable everywhere and less sensitive to outliers than MSE. 

import torch
import torch.nn.functional as F

y_true = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])

smooth_l1 = F.smooth_l1_loss(y_pred, y_true, beta=1.0)

print("Smooth L1 Loss:", smooth_l1.item())
Huber Loss 

Log-Cosh Loss 

Log-cosh loss is a smooth alternative to MAE and is defined as 

Log-Cosh Loss 

Near zero residuals, it behaves like squared loss, while for large residuals it grows almost linearly. This gives it a good balance between smooth optimization and robustness to outliers. 

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

error = y_pred - y_true

logcosh = np.mean(np.log(np.cosh(error)))

print("Log-Cosh Loss:", logcosh)
Log-Cosh Loss 

Quantile Loss 

Quantile loss, also called pinball loss, is used when the goal is to estimate a conditional quantile rather than a conditional mean. For a quantile level τ∈(0,1) and residual  u=y−y^  it is defined as 

Quantile Loss 

It penalizes overestimation and underestimation asymmetrically, making it useful in forecasting and uncertainty estimation. 

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

tau = 0.8

u = y_true - y_pred

quantile_loss = np.mean(np.where(u >= 0, tau * u, (tau - 1) * u))

print("Quantile Loss:", quantile_loss)
import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])
y_pred = np.array([2.5, 0.0, 2.0, 8.0])

tau = 0.8

u = y_true - y_pred

quantile_loss = np.mean(np.where(u >= 0, tau * u, (tau - 1) * u))

print("Quantile Loss:", quantile_loss)
Quantile Loss 

MAPE 

Mean Absolute Percentage Error, or MAPE, measures relative error and is defined as 

Mean Absolute Percentage Error

It is useful when relative error matters more than absolute error, but it becomes unstable when target values are zero or very close to zero. 

import numpy as np

y_true = np.array([100.0, 200.0, 300.0])
y_pred = np.array([90.0, 210.0, 290.0])

mape = np.mean(np.abs((y_true - y_pred) / y_true))

print("MAPE:", mape)
Mean Absolute Percentage Error

MSLE 

Mean Squared Logarithmic Error, or MSLE, is defined as 

Mean Squared Logarithmic Error

It is useful when relative differences matter and the targets are nonnegative. 

import numpy as np

y_true = np.array([100.0, 200.0, 300.0])
y_pred = np.array([90.0, 210.0, 290.0])

msle = np.mean((np.log1p(y_true) - np.log1p(y_pred)) ** 2)

print("MSLE:", msle)
Mean Squared Logarithmic Error

Poisson Negative Log-Likelihood 

Poisson negative log-likelihood is used for count data. For a rate parameter λ>0, it is typically written as

Poisson Negative Log-Likelihood 

In practice, the constant term may be omitted. This loss is appropriate when targets represent counts generated from a Poisson process. 

import numpy as np

y_true = np.array([2.0, 0.0, 4.0])
lam = np.array([1.5, 0.5, 3.0])

poisson_nll = np.mean(lam - y_true * np.log(lam))

print("Poisson NLL:", poisson_nll)
Poisson Negative Log-Likelihood 

Gaussian Negative Log-Likelihood 

Gaussian negative log-likelihood allows the model to predict both the mean and the variance of the target distribution. A common form is 

Gaussian negative log-likelihood

This is useful for heteroscedastic regression, where the noise level varies across inputs. 

import numpy as np

y_true = np.array([0.0, 1.0])
mu = np.array([0.0, 1.5])
var = np.array([1.0, 0.25])

gaussian_nll = np.mean(0.5 * (np.log(var) + (y_true - mu) ** 2 / var))

print("Gaussian NLL:", gaussian_nll)
Gaussian negative log-likelihood

Classification and Probabilistic Losses

Binary Cross-Entropy and Log Loss 

Binary cross-entropy, or BCE, is used for binary classification. It compares a Bernoulli label y∈{0,1} with a predicted probability p∈(0,1): 

Binary Cross-Entropy

In practice, many libraries prefer logits rather than probabilities and compute the loss in a numerically stable way. This avoids instability caused by applying sigmoid separately before the logarithm. BCE is convex in the logit for a fixed label and differentiable, but it is not robust to label noise because confidently wrong predictions can produce very large loss values. It is widely used for binary classification, and in multi-label classification it is applied independently to each label. A common pitfall is confusing probabilities with logits, which can silently degrade training. 

import torch

logits = torch.tensor([2.0, -1.0, 0.0])
y_true = torch.tensor([1.0, 0.0, 1.0])

bce = torch.nn.BCEWithLogitsLoss()
loss = bce(logits, y_true)

print("BCEWithLogitsLoss:", loss.item())
Binary Cross-Entropy

Softmax Cross-Entropy for Multiclass Classification 

Softmax cross-entropy is the standard loss for multiclass classification. For a class index y and logits vector z, it combines the softmax transformation with cross-entropy loss: 

Softmax cross-entropy

This loss is convex in the logits and differentiable. Like BCE, it can heavily penalize confident wrong predictions and is not inherently robust to label noise. It is commonly used in standard multiclass classification and also in pixelwise classification tasks such as semantic segmentation. One important implementation detail is that many libraries, including PyTorch, expect integer class indices rather than one-hot targets unless soft-label variants are explicitly used. 

import torch
import torch.nn.functional as F

logits = torch.tensor([
    [2.0, 0.5, -1.0],
    [0.0, 1.0, 0.0]
], dtype=torch.float32)

y_true = torch.tensor([0, 2], dtype=torch.long)

loss = F.cross_entropy(logits, y_true)

print("CrossEntropyLoss:", loss.item())
Softmax cross-entropy

Label Smoothing Variant 

Label smoothing is a regularized form of cross-entropy in which a one-hot target is replaced by a softened target distribution. Instead of assigning full probability mass to the correct class, a small portion is distributed across the remaining classes. This discourages overconfident predictions and can improve calibration. 

The method remains differentiable and often improves generalization, especially in large-scale classification. However, too much smoothing can make the targets overly ambiguous and lead to underfitting. 

import torch
import torch.nn.functional as F

logits = torch.tensor([
    [2.0, 0.5, -1.0],
    [0.0, 1.0, 0.0]
], dtype=torch.float32)

y_true = torch.tensor([0, 2], dtype=torch.long)

loss = F.cross_entropy(logits, y_true, label_smoothing=0.1)

print("CrossEntropyLoss with label smoothing:", loss.item())
Label Smoothing Variant 

Margin Losses: Hinge Loss 

Hinge loss is a classic margin-based loss used in support vector machines. For binary classification with label y∈{−1,+1} and score s, it is defined as  

Hinge Loss 

Hinge loss is convex in the score but not differentiable at the margin boundary. It produces zero loss for examples that are correctly classified with sufficient margin, which leads to sparse gradients. Unlike cross-entropy, hinge loss is not probabilistic and does not directly provide calibrated probabilities. It is useful when a max-margin property is desired. 

import numpy as np

y_true = np.array([1.0, -1.0, 1.0])
scores = np.array([0.2, 0.4, 1.2])

hinge_loss = np.mean(np.maximum(0, 1 - y_true * scores))

print("Hinge Loss:", hinge_loss)
Hinge Loss 

KL Divergence 

Kullback-Leibler divergence compares two probability distributions P and Q: 

KL Divergence 

It is nonnegative and becomes zero only when the two distributions are identical. KL divergence is not symmetric, so it is not a true metric. It is widely used in knowledge distillation, variational inference, and regularization of learned distributions toward a prior. In practice, PyTorch expects the input distribution in log-probability form, and using the wrong reduction can change the reported value. In particular, batchmean matches the mathematical KL definition more closely than mean. 

import torch
import torch.nn.functional as F

P = torch.tensor([[0.7, 0.2, 0.1]], dtype=torch.float32)
Q = torch.tensor([[0.6, 0.3, 0.1]], dtype=torch.float32)

kl_batchmean = F.kl_div(Q.log(), P, reduction="batchmean")

print("KL Divergence (batchmean):", kl_batchmean.item())
KL Divergence 

KL Divergence Reduction Pitfall 

A common implementation issue with KL divergence is the choice of reduction. In PyTorch, reduction=”mean” scales the result differently from the true KL expression, while reduction=”batchmean” better matches the standard definition. 

import torch
import torch.nn.functional as F

P = torch.tensor([[0.7, 0.2, 0.1]], dtype=torch.float32)
Q = torch.tensor([[0.6, 0.3, 0.1]], dtype=torch.float32)

kl_batchmean = F.kl_div(Q.log(), P, reduction="batchmean")
kl_mean = F.kl_div(Q.log(), P, reduction="mean")

print("KL batchmean:", kl_batchmean.item())
print("KL mean:", kl_mean.item())
KL Divergence Reduction

Variational Autoencoder ELBO 

The variational autoencoder, or VAE, is trained by maximizing the evidence lower bound, commonly called the ELBO: 

Variational Autoencoder

This objective has two parts. The reconstruction term encourages the model to explain the data well, while the KL term regularizes the approximate posterior toward the prior. The ELBO is not convex in neural network parameters, but it is differentiable under the reparameterization trick. It is widely used in generative modeling and probabilistic representation learning. In practice, many variants introduce a weight on the KL term, such as in beta-VAE. 

import torch

reconstruction_loss = torch.tensor(12.5)
kl_term = torch.tensor(3.2)

elbo = reconstruction_loss + kl_term

print("VAE-style total loss:", elbo.item())
Variational Autoencoder

Imbalance-Aware Losses

Class Weights 

Class weighting is a common strategy for handling imbalanced datasets. Instead of treating all classes equally, higher loss weight is assigned to minority classes so that their errors contribute more strongly during training. In multiclass classification, weighted cross-entropy is often used: 

Class Weights 

where wy  is the weight for the true class. This approach is simple and effective when class frequencies differ substantially. However, excessively large weights can make optimization unstable. 

import torch
import torch.nn.functional as F

logits = torch.tensor([
    [2.0, 0.5, -1.0],
    [0.0, 1.0, 0.0],
    [0.2, -0.1, 1.5]
], dtype=torch.float32)

y_true = torch.tensor([0, 1, 2], dtype=torch.long)
class_weights = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)

loss = F.cross_entropy(logits, y_true, weight=class_weights)

print("Weighted Cross-Entropy:", loss.item())
Class Weights 

Positive Class Weight for Binary Loss 

For binary or multi-label classification, many libraries provide a pos_weight parameter that increases the contribution of positive examples in binary cross-entropy. This is especially useful when positive labels are rare. In PyTorch, BCEWithLogitsLoss supports this directly. 

This method is often preferred over naive resampling because it preserves all examples while adjusting the optimization signal. A common mistake is to confuse weight and pos_weight, since they affect the loss differently. 

import torch

logits = torch.tensor([2.0, -1.0, 0.5], dtype=torch.float32)
y_true = torch.tensor([1.0, 0.0, 1.0], dtype=torch.float32)

criterion = torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor([3.0]))
loss = criterion(logits, y_true)

print("BCEWithLogitsLoss with pos_weight:", loss.item())
Positive Class Weight for Binary Loss 

Focal Loss 

Focal loss is designed to address class imbalance by down-weighting easy examples and focusing training on harder ones. For binary classification, it is commonly written as 

Focal Loss 

where pt  is the model probability assigned to the true class, α is a class-balancing factor, and γ controls how strongly easy examples are down-weighted. When γ=0, focal loss reduces to ordinary cross-entropy. 

Focal loss is widely used in dense object detection and highly imbalanced classification problems. Its main hyperparameters are α and γ, both of which can significantly affect training behavior. 

import torch
import torch.nn.functional as F

logits = torch.tensor([2.0, -1.0, 0.5], dtype=torch.float32)
y_true = torch.tensor([1.0, 0.0, 1.0], dtype=torch.float32)

bce = F.binary_cross_entropy_with_logits(logits, y_true, reduction="none")

probs = torch.sigmoid(logits)
pt = torch.where(y_true == 1, probs, 1 - probs)

alpha = 0.25
gamma = 2.0

focal_loss = (alpha * (1 - pt) ** gamma * bce).mean()

print("Focal Loss:", focal_loss.item())
Focal Loss 

Class-Balanced Reweighting 

Class-balanced reweighting improves on simple inverse-frequency weighting by using the effective number of samples rather than raw counts. A common formula for the class weight is 

Class-Balanced Reweighting 

where nc  is the number of samples in class c and β is a parameter close to 1. This gives smoother and often more stable reweighting than direct inverse counts. 

This method is useful when class imbalance is severe but naive class weights would be too extreme. The main hyperparameter is β, which determines how strongly rare classes are emphasized. 

import numpy as np

class_counts = np.array([1000, 100, 10], dtype=np.float64)
beta = 0.999

effective_num = 1.0 - np.power(beta, class_counts)
class_weights = (1.0 - beta) / effective_num

class_weights = class_weights / class_weights.sum() * len(class_counts)

print("Class-Balanced Weights:", class_weights)
Class-Balanced Reweighting 

Segmentation and Detection Losses

Dice Loss 

Dice loss is widely used in image segmentation, especially when the target region is small relative to the background. It is based on the Dice coefficient, which measures overlap between the predicted mask and the ground-truth mask: 

Dice Loss 

The corresponding loss is 

Dice Loss 

Dice loss directly optimizes overlap and is therefore well suited to imbalanced segmentation tasks. It is differentiable when soft predictions are used, but it can be sensitive to small denominators, so a smoothing constant ϵ is usually added. 

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6

intersection = torch.sum(y_pred * y_true)
dice = (2 * intersection + eps) / (torch.sum(y_pred) + torch.sum(y_true) + eps)

dice_loss = 1 - dice

print("Dice Loss:", dice_loss.item())

IoU Loss 

Intersection over Union, or IoU, also called Jaccard index, is another overlap-based measure commonly used in segmentation and detection. It is defined as 

IoU Loss 

The loss form is 

IoU Loss 

IoU loss is stricter than Dice loss because it penalizes disagreement more strongly. It is useful when accurate region overlap is the main objective. As with Dice loss, a small constant is added for stability. 

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6

intersection = torch.sum(y_pred * y_true)
union = torch.sum(y_pred) + torch.sum(y_true) - intersection

iou = (intersection + eps) / (union + eps)
iou_loss = 1 - iou

print("IoU Loss:", iou_loss.item())
IoU Loss 

Tversky Loss 

Tversky loss generalizes Dice and IoU style overlap losses by weighting false positives and false negatives differently. The Tversky index is 

Tversky Loss 

and the loss is 

Tversky Loss 

This makes it especially useful in highly imbalanced segmentation problems, such as medical imaging, where missing a positive region may be much worse than including extra background. The choice of α and β controls this tradeoff. 

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6
alpha = 0.3
beta = 0.7

tp = torch.sum(y_pred * y_true)
fp = torch.sum(y_pred * (1 - y_true))
fn = torch.sum((1 - y_pred) * y_true)

tversky = (tp + eps) / (tp + alpha * fp + beta * fn + eps)
tversky_loss = 1 - tversky

print("Tversky Loss:", tversky_loss.item())
Tversky Loss 

Generalized IoU Loss 

Generalized IoU, or GIoU, is an extension of IoU designed for bounding-box regression in object detection. Standard IoU becomes zero when two boxes do not overlap, which gives no useful gradient. GIoU addresses this by incorporating the smallest enclosing box CCC: 

Generalized IoU Loss 

The loss is 

Generalized IoU Loss 

GIoU is useful because it still provides a training signal even when predicted and true boxes do not overlap. 

import torch

def box_area(box):
    return max(0.0, box[2] - box[0]) * max(0.0, box[3] - box[1])

def intersection_area(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    return max(0.0, x2 - x1) * max(0.0, y2 - y1)

pred_box = [1.0, 1.0, 3.0, 3.0]
true_box = [2.0, 2.0, 4.0, 4.0]

inter = intersection_area(pred_box, true_box)
area_pred = box_area(pred_box)
area_true = box_area(true_box)

union = area_pred + area_true - inter
iou = inter / union

c_box = [
    min(pred_box[0], true_box[0]),
    min(pred_box[1], true_box[1]),
    max(pred_box[2], true_box[2]),
    max(pred_box[3], true_box[3]),
]

area_c = box_area(c_box)
giou = iou - (area_c - union) / area_c

giou_loss = 1 - giou

print("GIoU Loss:", giou_loss)
Generalized IoU Loss 

Distance IoU Loss 

Distance IoU, or DIoU, extends IoU by adding a penalty based on the distance between box centers. It is defined as 

Distance IoU Loss 

where ρ2(b,bgt) is the squared distance between the centers of the predicted and ground-truth boxes, and c2 is the squared diagonal length of the smallest enclosing box. The loss is 

Distance IoU Loss 

DIoU improves optimization by encouraging both overlap and spatial alignment. It is commonly used in bounding-box regression for object detection. 

import math

def box_center(box):
    return ((box[0] + box[2]) / 2.0, (box[1] + box[3]) / 2.0)

def intersection_area(box1, box2):
    x1 = max(box1[0], box2[0])
    y1 = max(box1[1], box2[1])
    x2 = min(box1[2], box2[2])
    y2 = min(box1[3], box2[3])
    return max(0.0, x2 - x1) * max(0.0, y2 - y1)

pred_box = [1.0, 1.0, 3.0, 3.0]
true_box = [2.0, 2.0, 4.0, 4.0]

inter = intersection_area(pred_box, true_box)

area_pred = (pred_box[2] - pred_box[0]) * (pred_box[3] - pred_box[1])
area_true = (true_box[2] - true_box[0]) * (true_box[3] - true_box[1])

union = area_pred + area_true - inter
iou = inter / union

cx1, cy1 = box_center(pred_box)
cx2, cy2 = box_center(true_box)

center_dist_sq = (cx1 - cx2) ** 2 + (cy1 - cy2) ** 2

c_x1 = min(pred_box[0], true_box[0])
c_y1 = min(pred_box[1], true_box[1])
c_x2 = max(pred_box[2], true_box[2])
c_y2 = max(pred_box[3], true_box[3])

diag_sq = (c_x2 - c_x1) ** 2 + (c_y2 - c_y1) ** 2

diou = iou - center_dist_sq / diag_sq
diou_loss = 1 - diou

print("DIoU Loss:", diou_loss)
Distance IoU Loss 

Representation Learning Losses

Contrastive Loss 

Contrastive loss is used to learn embeddings by bringing similar samples closer together and pushing dissimilar samples farther apart. It is commonly used in Siamese networks. For a pair of embeddings with distance d and label y∈{0,1}, where y=1 indicates a similar pair, a common form is 

Contrastive Loss 

where m is the margin. This loss encourages similar pairs to have small distance and dissimilar pairs to be separated by at least the margin. It is useful in face verification, signature matching, and metric learning. 

import torch
import torch.nn.functional as F

z1 = torch.tensor([[1.0, 2.0]], dtype=torch.float32)
z2 = torch.tensor([[1.5, 2.5]], dtype=torch.float32)

label = torch.tensor([1.0], dtype=torch.float32)  # 1 = similar, 0 = dissimilar

distance = F.pairwise_distance(z1, z2)

margin = 1.0

contrastive_loss = (
    label * distance.pow(2)
    + (1 - label) * torch.clamp(margin - distance, min=0).pow(2)
)

print("Contrastive Loss:", contrastive_loss.mean().item())
Contrastive Loss 

Triplet Loss 

Triplet loss extends pairwise learning by using three examples: an anchor, a positive sample from the same class, and a negative sample from a different class. The objective is to make the anchor closer to the positive than to the negative by at least a margin: 

Triplet Loss 

where d(⋅, ⋅) is a distance function and m is the margin. Triplet loss is widely used in face recognition, person re-identification, and retrieval of tasks. Its success depends strongly on how informative triplets are selected during training. 

import torch
import torch.nn.functional as F

anchor = torch.tensor([[1.0, 2.0]], dtype=torch.float32)
positive = torch.tensor([[1.1, 2.1]], dtype=torch.float32)
negative = torch.tensor([[3.0, 4.0]], dtype=torch.float32)

margin = 1.0

triplet = torch.nn.TripletMarginLoss(margin=margin, p=2)
loss = triplet(anchor, positive, negative)

print("Triplet Loss:", loss.item())
Triplet Loss 

InfoNCE and NT-Xent Loss 

InfoNCE is a contrastive objective widely used in self-supervised representation learning. It encourages an anchor embedding to be close to its positive pair while being far from other samples in the batch, which act as negatives. A standard form is 

InfoNCE

where sim is a similarity measure such as cosine similarity and τ is a temperature parameter. NT-Xent is a normalized temperature-scaled variant commonly used in methods such as SimCLR. These losses are powerful because they learn rich representations without manual labels, but they depend strongly on batch composition, augmentation strategy, and temperature choice. 

import torch
import torch.nn.functional as F

z_anchor = torch.tensor([[1.0, 0.0]], dtype=torch.float32)
z_positive = torch.tensor([[0.9, 0.1]], dtype=torch.float32)
z_negative1 = torch.tensor([[0.0, 1.0]], dtype=torch.float32)
z_negative2 = torch.tensor([[-1.0, 0.0]], dtype=torch.float32)

embeddings = torch.cat([z_positive, z_negative1, z_negative2], dim=0)

z_anchor = F.normalize(z_anchor, dim=1)
embeddings = F.normalize(embeddings, dim=1)

similarities = torch.matmul(z_anchor, embeddings.T).squeeze(0)

temperature = 0.1
logits = similarities / temperature

labels = torch.tensor([0], dtype=torch.long)  # positive is first

loss = F.cross_entropy(logits.unsqueeze(0), labels)

print("InfoNCE / NT-Xent Loss:", loss.item())
InfoNCE

Comparison Table and Practical Guidance

The table below summarizes key properties of commonly used loss functions. Here, convexity refers to convexity with respect to the model output, such as prediction or logit, for fixed targets, not convexity in neural network parameters. This distinction is important because most deep learning objectives are non-convex in parameters, even when the loss is convex in the output. 

Loss Typical Task Convex in Output Differentiable Robust to Outliers Scale / Units
MSE Regression Yes Yes No Squared target units
MAE Regression Yes No (kink) Yes Target units
Huber Regression Yes Yes Yes (controlled by δ) Target units
Smooth L1 Regression / Detection Yes Yes Yes Target units
Log-cosh Regression Yes Yes Moderate Target units
Pinball (Quantile) Regression / Forecast Yes No (kink) Yes Target units
Poisson NLL Count Regression Yes (λ>0) Yes Not primary focus Nats
Gaussian NLL Uncertainty Regression Yes (mean) Yes Not primary focus Nats
BCE (logits) Binary / Multilabel Yes Yes Not applicable Nats
Softmax Cross-Entropy Multiclass Yes Yes Not applicable Nats
Hinge Binary / SVM Yes No (kink) Not applicable Margin units
Focal Loss Imbalanced Classification Generally No Yes Not applicable Nats
KL Divergence Distillation / Variational Context-dependent Yes Not applicable Nats
Dice Loss Segmentation No Almost (soft) Not primary focus Unitless
IoU Loss Segmentation / Detection No Almost (soft) Not primary focus Unitless
Tversky Loss Imbalanced Segmentation No Almost (soft) Not primary focus Unitless
GIoU Box Regression No Piecewise Not primary focus Unitless
DIoU Box Regression No Piecewise Not primary focus Unitless
Contrastive Loss Metric Learning No Piecewise Not primary focus Distance units
Triplet Loss Metric Learning No Piecewise Not primary focus Distance units
InfoNCE / NT-Xent Contrastive Learning No Yes Not primary focus Nats

Conclusion

Loss functions define how models measure error and learn during training. Different tasks—regression, classification, segmentation, detection, and representation learning—require different loss types. Choosing the right one depends on the problem, data distribution, and error sensitivity. Practical considerations like numerical stability, gradient scale, reduction methods, and class imbalance also matter. Understanding loss functions leads to better training and more informed model design decisions.

Frequently Asked Questions

Q1. What does a loss function do in machine learning?

A. It measures the difference between predictions and true values, guiding the model to improve during training.

Q2. How do I choose the right loss function?

A. It depends on the task, data distribution, and which errors you want to prioritize or penalize.

Q3. Why do reduction methods matter?

A. They affect gradient scale, influencing learning rate, stability, and overall training behavior.

Janvi Kumari

Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Login to continue reading and enjoy expert-curated content.



Source link

Mobile Offer

🎁 You've Got 1 Reward Left

Check if your device is eligible for instant bonuses.

Unlock Now
Survey Cash

🧠 Discover the Simple Money Trick

This quick task could pay you today — no joke.

See It Now
Top Deals

📦 Top Freebies Available Near You

Get hot mobile rewards now. Limited time offers.

Get Started
Game Offer

🎮 Unlock Premium Game Packs

Boost your favorite game with hidden bonuses.

Claim Now
Money Offers

💸 Earn Instantly With This Task

No fees, no waiting — your earnings could be 1 click away.

Start Earning
Crypto Airdrop

🚀 Claim Free Crypto in Seconds

Register & grab real tokens now. Zero investment needed.

Get Tokens
Food Offers

🍔 Get Free Food Coupons

Claim your free fast food deals instantly.

Grab Coupons
VIP Offers

🎉 Join Our VIP Club

Access secret deals and daily giveaways.

Join Now
Mystery Offer

🎁 Mystery Gift Waiting for You

Click to reveal your surprise prize now!

Reveal Gift
App Bonus

📱 Download & Get Bonus

New apps giving out free rewards daily.

Download Now
Exclusive Deals

💎 Exclusive Offers Just for You

Unlock hidden discounts and perks.

Unlock Deals
Movie Offer

🎬 Watch Paid Movies Free

Stream your favorite flicks with no cost.

Watch Now
Prize Offer

🏆 Enter to Win Big Prizes

Join contests and win amazing rewards.

Enter Now
Life Hack

💡 Simple Life Hack to Save Cash

Try this now and watch your savings grow.

Learn More
Top Apps

📲 Top Apps Giving Gifts

Download & get rewards instantly.

Get Gifts
Summer Drinks

🍹 Summer Cocktails Recipes

Make refreshing drinks at home easily.

Get Recipes

Latest Posts

Don't Miss

Stay in touch

To be updated with all the latest news, offers and special announcements.