Image credit: Will Myers

A little bit of label noise won't hurt

Image credit: Will Myers

A little bit of label noise won't hurt

DisturbLabel paper approach, yet another regularization technique prevents overfitting. Unlike other regularization methods such as weight decay, dropour, or DropConnect, DisturbLabel regularizes loss. The TLDR version of the CVPR 2016 paper is: deliberately add noise (wrong labels) to small portion the ground truth labels in each minibatch during training and it helps overfitting. Therefore, the generalization of the network improves and the test set accuracy improves.

Adding label noise contributes to: 1. A noisy loss. 2. A noisy gradient backpropagation.

Authors state that DisturbLabel is equivalent to combining a large number of models trained with different noisy data.

Algorithmic details:

DisturbLabel operates on each minibatch independently. Here is the pseudocode:

In other words, at each minibatch, DisturbLabel randomly generates a label for each data from a Multinoulli distribution $P(\alpha)$:

if $c$ is the ground truth class for the data, then $P(\alpha)$ is defined as:

$$ p_c(\alpha) = 1 - \frac{C - 1}{C}*\alpha $$

$$ p_i(\alpha) = \frac{1}{C}*\alpha $$

in which $\alpha$ is the noise rate and $C$ is the total number of classes. For each sample, the label is randomly drawn from a uniform distribution over all classes. Here is an example for one data with $C=10$ classes:

$$ groundtruth-label = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] $$ $$ P(\alpha) = [0.01, 0.01, 0.91, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01] $$ So the new label for that particular data is drawn uniformly from $P(\alpha)$ with a command like numpy.choice from numpy. Consequently, the label will be the same as ground-truth label most of the time, but few times during training the data is assigned with a wrong label. In other words if $\alpha=10$, you would expect that 10% of the labels are assigned with a wrong label for each minibatch.

## Results:

### effect of noise: - The algorithm has better performance for small noise rates such as $\alpha=10, 20$. The algorithm performance degrades for larger noise rates such as $\alpha=50$.

Amir Hossein Farzaneh
Computer Vision, Deep Learning, and Artificial Intelligence Researcher