Posted on April 4, 2023 by Codec Team

Addressing the problem of data bias in AI models

Ai blog

AI has been used by humans for decades. It was first discovered as a tool in the late 1950s and its usage has grown massively in recent years, thanks in part to popular programmes such as ChatGPT and Dall-E.  

At Codec, we have been working in AI for over 8 years, so our own experiences have shaped how we view the tool. Similarly, with seemingly all eyes on AI, more people are starting to become aware of its limitations, alongside its uses. 

Like all tech developments, AI is going to have teething problems. While great things are seemingly just on the horizon, we took a look at one of the main areas currently being debated - the problem of bias. 

In order to gain its ‘intelligence’, AI is trained on incredibly huge data sets. The larger the source, the more accurate (in theory) your AI will be. But simply ensuring that a data source is large does not mean it is reliable, or more likely to be free from bias. 

Machine learning models can end up perpetuating dominant viewpoints found in the dataset, further encoding these biases and perspectives. This can then lead to increased power imbalances and social inequality. 

For image-based applications like Dall-E, trained on Computer Vision Models, the datasets available for training are dominated by white faces. This leads to poor model accuracy, predominantly for non-white ethnic minorities. As these models are used more and more, from surveillance to security, the negative impact of this inaccuracy on people continues to grow. 

Similarly, for text-based applications like ChatGPT, AI is trained using Large Language Models (LLMs). LLMs are often considered to be black boxes - it’s hard to tell what viewpoints or references are captured within them, and so it’s impossible to know that the text doesn’t contain harmful, biased or prejudiced content. 

So - what can be done?

Preventing data bias in AI requires a multi-faceted approach. Firstly, it's essential to ensure that the data used to train AI models is representative of the diverse populations that it's intended to serve. This can be achieved through collecting and labelling data that includes a broad range of demographics and social backgrounds (which - when training huge models - can be easier said than done).

Additionally, it's crucial to implement thorough testing and evaluation processes to detect and mitigate bias in the data and the AI models. Again - as the size of AI models increases rapidly, so do the time and resources needed to thoroughly undertake this task. 

Ensuring a diverse development team is crucial for the creation of fair and unbiased AI systems. Your team's perspectives and experiences will play a direct role in the design and development of AI systems. A diverse team can bring a wide range of perspectives and problem-solving approaches, helping to identify potential biases and blind spots that may not be immediately obvious.

Codec’s approach to helping eliminate unwanted data bias 

Codec does not train or apply our AI models on sensitive information such as gender, race, financial status etc. We don’t attempt to classify by any demography outside of commercial markets and focus only on cultural interest topics and content engagements within them. 

We even carry that philosophy into advertising practices with our partners, by focusing on contextual targeting - placing ads based on the substance of the page, not the demographics of the reader. In this way, our technology is addressing people based on their interest in a certain topic, not their demographic or personal information, which might result in a biased placement.

At the same time, we train models to detect topics in documents - most commonly URLs. This training data is carefully labelled by our experienced and diverse data analyst team to ensure no bias or toxic content can infiltrate at this stage. We have an active research presence in topics related to toxic content in different modalities and we participate in academic conferences and competitions related to toxic content detection.

At Codec, we work with ambitious brands globally, helping them supercharge their growth through the power of communities. 

Want to tap into your undiscovered communities and unlock your brand potential? Get in touch

Next post