DisturbLabel paper approach, yet another regularization technique prevents overfitting. Unlike other regularization methods such as *weight decay*, *dropour*, or *DropConnect*, *DisturbLabel* regularizes loss. The TLDR version of the CVPR 2016 paper is: **deliberately add noise (wrong labels) to small portion the ground truth labels in each minibatch during training and it helps overfitting.** Therefore, the generalization of the network improves and the test set accuracy improves.

Adding label noise contributes to: 1. A noisy loss. 2. A noisy gradient backpropagation.

Authors state that DisturbLabel is equivalent to combining a large number of models trained with different noisy data.

## Algorithmic details:

DisturbLabel operates on each minibatch independently. Here is the pseudocode:

In other words, at each minibatch, DisturbLabel randomly generates a label for each data from a Multinoulli distribution $P(\alpha)$:

if $c$ is the ground truth class for the data, then $P(\alpha)$ is defined as:

$$ p_c(\alpha) = 1 - \frac{C - 1}{C}*\alpha $$

$$ p_i(\alpha) = \frac{1}{C}*\alpha $$

in which $\alpha$ is the noise rate and $C$ is the total number of classes. For each sample, the label is randomly drawn from a uniform distribution over all classes. Here is an example for one data with $C=10$ classes:

$$
groundtruth-label = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
$$
$$
P(\alpha) = [0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]
$$
So the new label for that particular data is drawn uniformly from $P(\alpha)$ with a command like `numpy.choice`

from *numpy*. Consequently, the label will be the same as ground-truth label most of the time, but few times during training the data is assigned with a wrong label. In other words if $\alpha=10$, you would expect that 10% of the labels are assigned with a wrong label for each minibatch.

## Results:

### effect of noise: - The algorithm has better performance for small noise rates such as $\alpha=10, 20$. The algorithm performance degrades for larger noise rates such as $\alpha=50$.