Today, multimodal and generative AI shapes how media companies tell stories, enabling them to share more compelling narratives. Through this next-gen AI, media companies can describe what is in audiovisual media, enabling quick and relevant content searches. Yet one of the challenges with using AI for video indexing is that the datasets do not reflect the diversity of the world. Most datasets are created in Western countries and in English, regardless of the actual distribution of language speakers worldwide.
On Hugging Face, more than 65% of the world’s datasets are in English and less than 4% are in Arabic. The percentage of English to Arabic datasets is vastly disproportionate, given that 400m people around the world use English as their native language and 370m speak Arabic. This discrepancy arises from several factors. English serves as a universal language for research and communication, influencing dataset creation. Moreover, market size drives language-related efforts and quality, impacting aspects such as transcription accuracy.
The bias of AI datasets affects how stories are told and how cultural narratives are conveyed by broadcasters and media companies. For example, a shot of a man wearing a kufiya or keffiyeh may be incorrectly detected and described by AI as a hijab. This is termed a hallucination, the result of the AI being trained on data that contained limited or no examples of hijabs and keffiyehs, so it lacked knowledge of this Middle Eastern clothing and who it is worn by. To address this, it is important to continuously add cultural information from diverse sources to existing datasets.
How to Reduce Biases in AI-Generated Content
One solution is prompt engineering. This involves adjusting search prompts to achieve better suggestions. Asking Midjourney for a picture of a researcher making a discovery will likely produce four different propositions of a Caucasian male in his 40s in a white shirt or lab coat. Adjusting the prompt and suggesting other cultures and genders will improve the diversity of the search results. Google Gemini attempted to address this but faced drawbacks, including inaccuracies in some historical image generation depictions.
Another option is fine-tuning the model. This means adapting an existing model for a specific task. Instead of retraining a new model from scratch, extra data is added to the base model. This can be very efficient, but to be effective AI engineers and business experts must work together to provide real-world use cases and the necessary data.
Applying a multimodal approach (i.e., multiple sensory modalities) to AI further improves video indexing results and can reduce the likelihood or frequency of some AI hallucinations. Rather than relying on a single source for indexing, multimodal AI takes into account numerous sources, such as objects, context, geo-location, text, facial recognition, Wiki data, brand logos and other visual patterns, transcriptions and translations. Utilising collective memories, personal learnings, hearing and the notion of space and time, metadata applied through multimodal AI indexing leads users to the exact moment and gives them the precise context they need.
Over the long term, sharing additional data with the open-source community and promoting diversity among research teams will go a long way toward directly impacting cultural representation.
Frederic Petitpont is the Co-founder & CTO of Moments Lab.