Navigating the Potholes in Demographic Data ML Models

Biases snuck in from incorporating demographic details without the machine learning developer’s knowledge like in Amazon’s recruitment model.

We need stricter policies and regulations to enable the fair use of demographic data to build ML models or mitigate biases in existing models

Machine learning algorithms improve the accuracy of human decision-making by leaps and bounds. Such algorithms take multiple parameters into account to conduct analysis and come to a decision.

Cognitive biases influence human decisions. And when fallible humans build machine learning models, biases creep into algorithms. The decisions made by biased machines can have far-reaching consequences. The repercussions are more severe when models use demographic data like age, gender, race, or zip codes since they can impact communities as a whole.

by Kashyap Raibagi

In this article, we try to analyse the use of demographic data when building ML models to produce fair AI.


Machine learning is used to develop decision-making systems across sectors. Models could help with diagnosis in healthcare, perform market segmentation in retail, or build recidivism models to reduce crime.

In some cases, demographic data is essential, especially when building diagnosis or prediction models in healthcare. For many illnesses, age, gender or socio-demographic factors like income or neighbourhood become crucial decision-making parameters.

For instance, age is a risk factor for many diseases, including cancer or cardiovascular conditions. Gender can play an important role in obesity disphormism or coronary artery diseases. Economically weaker neighbourhoods are at a higher risk of infectious diseases like dengue or tuberculosis.

However, introducing demographic characters has led to discrimination against people, predominantly minority or socio-economically weaker communities. For instance, a recidivism model used in the US consistently put blacks at a higher risk than white people in facing the heat of law, even when the formers’ crimes were significantly less severe.

In another instance, Amazon’s recruitment model did not rate candidates in a gender-neutral way as the model was trained on resumes mostly from men. This resulted in the system penalising resumes with the word women in them.

The discrimination engenders from introducing demographic details in models, reflecting the inherent bias in human beings. From an ethical perspective, using demographic information to make decisions, like assigning recidivism scores based on race or allocating a bank loan, is prejudicial.

Biases snuck in from incorporating demographic details without the machine learning developer’s knowledge like in Amazon’s recruitment model. In that case, developers should take extra caution while deploying such an algorithm in the real world. Third-party audits should be compulsory for any algorithms that make decisions for human beings.

Handle With Care

On the flip side, some machine learning models have shown the need to use inclusive demographic representation to mitigate bias.

For instance, Timnit Gebru, the AI ethicist who recently got fired from Google, published a paper in 2018 that found significant disparities in facial recognition systems developed by the Big Tech. Her study revealed that all classifiers in these models performed the best for lighter male individuals but the worst for dark women. Flawed facial recognition system models have led to a Black US citizen wrongly arrested due to misidentification.

Whether algorithms like facial recognition should be deployed in the first place is out of this article’s scope, the study showed that algorithm development needed more inclusivity and analysis on features specific to demography; racial traits in this case.

Inclusive demographic data could help mitigate bias, but the decision as to when and how to use them for bias mitigation is critical. Partnership on AI addressed such concerns in a report in 2020.

The first concern is, how should demographic data be defined. While the US and the EU have taken an effort to categorise demographic data as ‘protected class data’ or ‘sensitive personal data’, many countries, including India, have weak data protection laws. In such a case, collecting demographic data might do more harm than good.

Further, the decision-makers should be careful that their approach to mitigate bias is not itself biased. For instance, self-selection bias (collecting data from only those who want to give it to you) can compound the problem.

Lastly, once the data is collected, it is essential to ensure that it is used towards the original objective.

Wrapping Up

Some models present the absolute need for demographic details, especially in healthcare. In such cases, extra caution should be applied to mitigate biases. Further, we need stricter policies and regulations to enable the fair use of demographic data to build ML models or mitigate biases in existing models.

Raibagi - demographic dataKashyap currently works as a Tech Journalist at Analytics India Magazine (AIM). 

This article originally appeared in Analytics India magazine. Photo by Marc-Olivier Jodoin on Unsplash.

1 comment
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Previous Article

What the Modern Consumer Journey Looks Like

Next Article
consumer psychology

Hope and Anxiety are Literally Fueling Product Adoption

Related Posts

Subscribe to TheCustomer Report

Customer Enlightenment Delivered Daily.

    Get the latest insights, tips, and technologies to help you build and protect your customer estate.