Artificial intelligence (AI) systems rely heavily on large and diverse datasets to make them as accurate, efficient, and advanced as possible. Large language models (LLMs) like ChatGPT-4 were trained on around 570 GB of data, including web pages, books, and other sources.

Where does this data come from?

Many of these systems employ methods like data scraping, public data repositories, and partnerships with organizations for data access. These methods provide the data used to refine the technology, make it more closely emulate human behavior and, ultimately, be more effective.

With the current surge in generative AI and these systems being used to create art, music, text, and more, it poses a critical ethical question: where does consent come into play?

Understanding consent in AI training

Put simply, consent in the context of AI training data means informing individuals about how their data will be used so they understand and agree to this use.

Consent helps build trust between the public and AI developers.
It ensures that data processing complies with legal standards, allowing individuals to have control over their personal information.
Consent frameworks help hold AI developers accountable for their data practices and promote ethical AI development.
Transparent consent processes can lead to higher quality and more reliable data.

Informed consent - consent being given freely and voluntarily, with awareness of the potential risks and benefits of data being used - is particularly important.

However, using personal data without consent can lead to significant negative implications, legal consequences, loss of trust, reputational damage, and security risks.

If AI systems are being powered by data from individuals without their knowledge, it makes these individuals more vulnerable to personal data breaches, fraud, or worse. During a time when cyber attacks continue to rise, this security risk cannot be taken lightly.

Attacks on open-source libraries (which are used to store data, documentation, and pre-written code, and are heavily relied on by companies like Open AI) have increased by 742% since 2019. Moreover, in 2024, an AI Threat Landscape report showed that 77% of businesses had faced AI security breaches.

This issue also stretches beyond Generative AI. For example, in the security sector, AI systems are used for surveillance, predictive policing, and threat detection. These systems often rely on large amounts of personal data, including video footage, biometric data, and other sensitive information - all of which require consent. Notably, the EU AI Act explicitly forbids certain AI applications trained on data scraped from the internet or CCTV footage to create facial recognition databases.

While consent alone cannot prevent cyber-attacks, data collected and used with proper consent can help mitigate some risks by promoting better data handling practices.

Generative AI and the legal battleground

Several high-profile legal actions and controversies have highlighted the issue of consent in AI training, particularly regarding generative AI.

In 2023 Getty Images sued Stability AI over the unauthorized use of copyrighted images. In the same year, the creators of Game of Thrones faced issues regarding the use of content without permission; the authors alleged harmful infringement of copyrights and accused ChatGPT of being reliant on “systematic theft on a mass scale.”

In March 2024, the Publisher’s Association, including major publishers like Penguin and Harper Collins, demanded that tech companies seek consent before using copyright-protected works to develop AI systems.

These legal battles also span technology and programming itself. For example, GitHub and Microsoft faced lawsuits concerning code use without consent in 2023, with a programmer also suing OpenAI for allegedly plagiarizing code.

Current legislative pushes for consent

In the US, Senators introduced the AI CONSENT Act in March 2024. This legislation includes key provisions such as obtaining express informed consent from individuals and enforcing these standards through the FTC. Additionally, in January 2024, FTC Chair Lina Khan announced a probe into AI models that collect data unlawfully, underscoring the regulatory focus in the US.

The AI CONSENT Act is also not limited to generative AI.

In healthcare, it ensures that patient data used for training diagnostic algorithms is handled with explicit consent.

In retail, it mandates clear and informed consent for consumer behavior analysis, ensuring shoppers are aware of how their data is used.

Beyond the United States, other legislative efforts and regulatory responses reflect a global push for ethical AI practices. In the European Union, Italy restored access to ChatGPT after OpenAI introduced a user opt-out form and the ability to object to the use of personal data. In the UK, the House of Lords Communications and Digital Committee also published support for licensing and transparency for AI models, advocating for similar ethical standards.

The challenges and practicalities of implementing consent

First and foremost, obtaining and managing consent can be labor-intensive and costly - particularly when dealing with vast amounts of data, much of which may come from publicly available sources. The debate around whether consent is necessary for public data, or whether it falls under fair use, further complicates the issue.

De-identifying data is often suggested as a solution to mitigate privacy concerns, but this approach has its own limitations. The AI CONSENT Act, for instance, acknowledges these limitations and mandates a study to explore the efficacy of data de-identification.

In the competitive tech landscape, continuously seeking consent can sometimes be perceived as a hindrance. Companies are often driven to push the boundaries of AI capabilities, which can conflict with the slower, more deliberate processes required to obtain and manage consent effectively.

Moreover, the dynamic nature of AI models, which constantly evolve and improve through ongoing data input, makes the implementation of consent frameworks even more complex. Consent mechanisms that keep pace with technological advancements require continuous oversight and adaptation, which can be tricky for many organizations.

There are already movements in the private sector to address these issues; for instance, OpenAI’s incognito mode allows users to opt out of data collection for AI training. Additionally, the music industry’s “Fairly Trained” initiative creates a certification scheme to ensure AI models in the sector are trained with consent.

Consent in AI training data is the cornerstone of ethical AI development. It builds trust, ensures legal compliance, and promotes the responsible use of personal data. To safeguard these principles, ongoing dialogue and decisive action are crucial. Stakeholders must prioritize consent to enhance public trust and accountability and foster a culture of ethical AI practices.

A case study in responsible AI training: Secure Redact

At Pimloc, we are dedicated to leading by example in ethical AI practices. Our platform, Secure Redact, is built using publicly available datasets that come with permissive licenses. This ensures that our data collection methods, such as gathering customized video footage, do not compromise privacy. Our approach is centered on three key principles: minimal privilege, restrictive access only to what is necessary, and data anonymization.

Find out more about how we approach AI training at Pimloc.

Secure Redact’s accuracy

Is consent being prioritized in AI training?