Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!

Demystifying Face Recognition III: Noise

In last years there were introduced many dataset for Face Recognition, named several:

The main difference between them are the number of images and identities. But they're also different in sth like signal-to-noise ratio, where we mean that ratio between correctly labeled images to all collected images. We do not know such value for any provided database, but some researcher provided the clean-list which provided images which was verified to be correct (by human or face-recognition algorithm). For example CASIA-WebFace have 90% of correct images, also VGG have ~90% (supposing that 15% of noisy labels is a real noise). But what does it mean relative to accuracy of Face Recognition model? Does noise really hurt the performance in verification/identification protocol?

Noisy Database

First of all, let's make some literature review.
One of the publication which we know about noise in data for Deep Learning is The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition. The idea was simple: instead of marking pictures manually use information gathered in Web (metadata and search-engine). This enable to collect many more images than existed in current training dataset of several benchmarks. The results indicate that when training on original images and from web, the final accuracy is always higher (despite the fact, that the data in noisy).

The first public face-recognition database with significant amount of noise was VGG. We will describe their method in more details. They collected 5M images from both Google and Bing Image Search. Then using Machine Learning algorithm with Fisher Vector Faces descriptor, half of images was removed. The last stage was manually filtering, which take 10 days (really hard work), ending with ~1M of images. Additional the images before manually filtering were provided, what end up with 2.6M. In experiments, Oxford team was using both dataset (named full and curated). And the bigger one leads to better results at all of their experiments, although it contain more noisy images than right one. Before jumping to any conclusion, let's take a closer look into manually filtering process by reading the paper:

The aim of this final stage is to increase the purity (precision) of the data using human annotations. However, in order to make the annotation task less burdensome and hence avoid high annotation costs, annotators are aided by using automatic ranking once more. This time, however, a multi-way CNN is trained to discriminate between the 2,622 face identities using the AlexNet architecture; then the softmax scores are used to rank images within each identity set by decreasing likelihood of being an inlier. In order to accelerate the work of the annotators, the ranked images of each identity are displayed in blocks of 200 and annotators are asked to validate blocks as a whole. In particular, a block is declared good if approximate purity is greater than 95%.

So filtering was done using block of images, which were accepted or refused by looking into purity of it. It means that if there is even some noise in block (more than 5%), then block is removed. To sum up, in 1.6M of noisy images there could be even ~1.4 of good labels. The similar situation is at other dataset like CASIA or MsCeleb: we are not sure that noise is really a noise, because most of time manually annotators does not investigate each of image independently. But from both of them researchers provided list of clean images by removing the noise manually (CASIA) or in automatic way (MsCeleb).

Experiments on CASIA-WebFace

The first test will be based on UMD database (as very clean database) and different version of CASIA-WebFace database:

Idea behind using this databases is testing algorithm on very clean database with smaller size (UMD) versus bigger and noisier database with some efforts of cleaning it. It is really worth to clean the dataset, which cost many of hard manual work?

For training the network, we will be using the same setting like for baseline. Here are results.

There is no clean conclusion from this experiments. First of all, we can see that delivered clean-list for CASIA-WebFace really have lower noise, because network trained using them have lower loss and higher validation accuracy. But this is the main difference, all network achieve similar score at LFW and BLUFR benchmarks. The highest score achieve the cleanest one, clean-v2, but the difference is only ~1%. Maybe if dataset would contain more images, the bigger change would be noticed. UMD database despite of being very clean, achieve the lowest score. From this set of experiment lool like that there is no big advantage of cleaning dataset if we suppose than there is no more than 10% of incorrect labels.

Experiments on VGG-Database

As the first experiments does not make any clean conclusion, let's use VGG database. As the number of noisy labels in VGG is much bigger, here will be conducted experiments using different ratio of noisy labels (exactly images labeled as noisy to all images in training set). We hope that it will show how different amount of noisy labels impact the final accuracy (remembering that the noisy labels in case of VGG maybe may have 10-15% of real noise). For coherence of results, in each experiment same validation dataset is used.

The conclusion from above experiment are the same like in original VGGFace paper: A large dataset set is more important than a low level of noisy labels. However we do not know a real amount of noise in VGG dataset. Let's check this in control condition. We will be using always the dataset of same size (0.75M of images), but with different % of real noise in labels (so this is the size situation when we have smaller training set + noisy data). We hope that this experiment will enable us to set a range of noise which is not harmful for final model accuracy.

First of all, looks like to get meaningful result, the kFold idea should be used. Using only just one model make model more random (maybe it is connected with % of noise in data?). From current result we can see that having even 15% of noise make just slight lower score than pure one. Based on the VGG paper, it is more than they estimate at their dataset.

Experiments using Real Noise

We decided to carry out last experiment. Using the cleanest version of CASIA-WebFace we will be injecting real noise to data with much more noise than before. The noise will be formed from images taken from VGG database with random labels. As we would like to have a small amount of noise in base training data, we will use clean-v2 list. The idea behind it is making the dataset bigger with just pure noise, how this effect the final accuracy?

Finally, this experiments provide us clear conclusion. The real noise, even in small amout, hurt the final accuracy. Although the model with 10% of noise achieve better score at LFW, the BLUFR protocol show the bigger difference (but both model are comparable). So we can conclude that is nice to have less than 10% of noise. Also strange situation happened with noise ratio of 50%: the network diverge, so we should lower the learning-rate.

In the other hard, it is really intriguing that having the 80% of noisy labels still network can learn anything useful. The similar result was achieved by researcher in paper Deep Learning is Robust to Massive Label Noise, even that the experiments have 99% of noise. The researcher have very novel conclusions:

  • Deep neural networks are able to learn from data that has been diluted by an arbitrary amount of noise.
  • A sufficiently large training set is more important than a lower level of noise.
  • Choosing good hyperparameters can allow conventional neural networks to operate in the regime of very high label noise.

For sure Deep Learning is amazing in handling the noisy data, but we cannot agree with second conclusion, already owning ~0.5M of data example. Having 0.4M of images with clean labels is better than any database with larger amount of images but with noise. Other pros of having clean and small database is faster convergence (as there is less training examples). In the other hand, when we are collecting our own dataset it is inevitable to have noisy labels. Our experiments show that data should not have more than 10% of real noise.


The summary of above experiments are harder than we except. Look like there is no clean conclusion from them because impact of noise depend on many variables (ex. data size, number of identities). We should make many more experiments with controlled environment to really analyse the impact of noisy labels to final accuracy of model. but Here are our thoughts:

  • the training dataset should not have more than 10% of real noise in dataset
  • when we are able to gather easily dataset with noise less than 10%, it is worth doing it (maybe even this is better option rather than cleaning the dataset)
  • making a dataset bigger by just inputting random images with random labels is not good idea

Ok, so test another aspect of preparing the dataset for Face-Recognition. The next question which will we ask about pipeline is:
How the input image should be aligned before feeding it to network?