Our take on the ethical maze of data scraping in AI

February 6, 2024 Seena Williams

Every day, an astonishing 2.5 quintillion bytes of data are generated globally.

AI models harness a significant portion of this data through data scraping. This practice is central to AI advancements but has recently entangled leading AI entities like OpenAI and Microsoft in legal and ethical controversies.

How can we navigate the ethical maze of data scraping in AI training? A closer case study of our data scraping for training the Secure Redact AI models.

Behind the AI curtain

Data scraping in AI involves web crawling – an automated process where vast amounts of internet data, ranging from text to images, are harvested. This data undergoes extraction, aggregation, and preprocessing, forming the foundation for training AI models.

These models are constantly refined with new data, driving towards AI excellence. The recent acknowledgement by Google of its data scraping practices and Twitter's measures to counteract such activities (like access restrictions and rate limiting) underscores the widespread nature of this practice.

The ethical dilemma: privacy vs progress

The AI domain is caught in a stark dichotomy: the relentless pursuit of technological progress at odds with individual privacy.

Some experts express concerns over data transparency and the challenges in data extraction post-AI training. While legal in the US, web scraping also raises significant public and legal concerns.

The use of publicly available data, as seen in the case of Clearview AI, raises questions about deployment and anticipated use. The debate extends to the fair use of copyrighted material for AI training. Does using such material for AI training fall under fair use, or does it infringe on intellectual property rights?

Transparency and accountability are key

AI experts and ethicists like Timnit Gebru who emphasises in her 2018 paper "Datasheets for Datasets" the demand for transparency in AI data practices is growing. This need for openness is also evident in legal and regulatory domains, with a push for laws that balance data protection with ethical AI development. The incorporation of robust citation practices and recognition of copyrights in AI development (particularly in the realm of large language models) could significantly enhance ethical standards.

In the UK, there are ongoing efforts to update and share codes of practice and guidance, but such a law is not currently in place. The need for such specific legislation underscores the importance of accountability in AI development. By mandating source citation and copyright acknowledgement, AI models can foster transparency and ensure that creators and copyright owners are duly recognised for their contributions. This practice not only addresses ethical concerns but also fosters a culture of respect and responsibility towards intellectual property, ultimately leading to more trustworthy and ethically grounded AI systems.

A case study in responsible AI training: Secure Redact

At Secure Redact, we try to model ethical AI practice. We use publicly available datasets with permissive licences and responsibly gather data (like tailored video footage), so our team can train our AI models without infringing on privacy. Our approach is also one that is centred on minimal privilege and data anonymisation to match our commitment to ethical AI and to grow our mission: to advance visual AI systems in the interests of people and their freedoms.

As the AI industry evolves, the conversation around ethical AI development becomes increasingly relevant. A sustainable path forward involves striking a balance between innovation and privacy, to ensure progress in AI does not undermine individual rights.

The industry must prioritise transparency and responsible data usage, adhering to ethical standards that protect both the public interest and the integrity of intellectual property. Data scraping is not a trend that will slow down any time soon, so it is the responsibility of industry heads and experts to balance these developments with ethical practices.

SecureRedact