Facial Keypoint Detection with Neural Networks
By Ajay Bhargava
Nose Tip Detection
This project utilized the IMM Face Databse, a set of 244 annotated facial images.
The first step was to detect the nose feature using a convolutional neural network. Example images from the dataset and their annotated nose feature are shown below.


Using the first 192 images as a training set, and the final 48 images as the validation set, I trained a CNN with 3 convolutional layers, each with 20 hidden channels. The training and validation loss across the 20 epochs are shown below.

The model perfomed fairly well. 2 example outputs where the model performed well, as well as 2 chosen outputs where the model performed the worst. The red dots represent where the model predicted the nose, and the blue is the actual annotated point.




A possible explanation for why this occurred is the facial structure and orientation of the people in the images – as this would affect how well the model is able to predict the nose location.
Full Facial Keypoints Detection
With the nose tip detection working, the model can be expanded to predict all of the facial keypoints for an image. There are 58 keypoints annotated in the dataset. Some examples of the images and their corresponding annotations are shown below.


The convolutional neural network I trained had 5 convolutional layers, each with between 20-30 channels and a kernel size of 5. This was followed by two fully connected layers which got the output the be of size 58x2.
Loss for the dataset across the epochs are shown below.

Some examples of successful detections, and some unsuccessful detections, are shown below. The unsuccessful ones may be caused by more unique poses, orientations, or facial features..




The following are some examples of what the filters look like – specifically in the first convolutional layer:

Training on a Larger Set
The above examples are isolated to using just a small set of data for training and testing. Training a larger model on a larger set of data would alolow us to yield better results. The following section used a dataset of annotatied faces from the Intelligent Behavior Understanding Group (iBug) from Imperial College London. It contains over 6000 annotated faces, from all angles and backgrounds. The following are some example images with their annotations from the dataset:



The model I used in this section was a premade architecture in Pytorch – Resnet50. This is a large convolutional net that is 50 layers deep. It was only slightly altered to take in the correct size images and output the correct amount of points for facial annotations.
The loss for training this model is shown below.

Below are some results from the model, both with images from the dataset, as well as with my own images that I gave to the model.






Pixelwise Classification
Another task that neural networks can accomplish is a pixelwise classification of how likely a certain pixel is to be a keypoint on the face. I again used the Resnet50 model to predict how likely a point is to be a keypoint. This is a convolutional neural network that is 50 layers deep.
A 2D Gaussian around each keypoint was used with a standard deviation of 12. These values were summed to geneate heatmaps. Results are shown below - both on images from the dataset as well as my own.





