IBM to Release Huge Facial Analytics Dataset For Studying Bias

This fall, IBM will make publicly available a facial attribute and identity training dataset of over 1 million images, to help improve facial analysis system training built by IBM Research scientists.

The dataset will be annotated with attributes and identity, leveraging geo-tags from Flickr images to balance data from multiple countries and active learning tools to reduce sample selection bias. Currently, the largest facial attribute dataset available is 200,000 images so this new dataset with a million images will be a monumental improvement. Additionally, data sets available today only include attributes (hair color, facial hair, etc) or identity (identifying that 5 images are of the same person) - but not both. This new dataset changes that to make a single capability to match attributes to an individual.

IBM will also release a dataset which includes 36,000 facial images - equally distributed across all ethnicities, genders, and ages to provide a more diverse dataset for people to use in the evaluation of their technologies. This will specifically help algorithm designers to identify and address bias in their facial analysis systems. The first step in addressing bias is to know there is a bias - and that is what this dataset will enable.

Earlier this year, IBM increased the accuracy of the Watson Visual Recognition service for facial analysis, which demonstrated a nearly ten-fold decrease in error-rate for facial analysis. A technical workshop is being held (by IBM Research in collaboration with University of Maryland) to identify and reduce bias in facial analysis on Sept 14, 2018 in conjunction with ECCV 2018. IBM will announce the results of the competition using the IBM facial image dataset at the workshop.

Society is paying more attention than ever to the question of bias in artificial intelligence systems, and particularly those used to recognize and analyze images of faces.

"AI holds significant power to improve the way we live and work, but only if AI systems are developed and trained responsibly, and produce outcomes we trust. Making sure that the system is trained on balanced data, and rid of biases is critical to achieving such trust," says IBM.