The gathering and use of personal data has permeated every aspect of our existence in the current digital era. Vast datasets enable AI advancement and personalized experiences, yet mounting apprehension regarding data misuse continues to grow. Every search query, location ping, purchase, and social interaction generates data points that are collected, aggregated, and — in most cases — monetised without the meaningful understanding or consent of the individuals who produced them.
The Scale of Data Collection Today
The sheer volume of personal data generated and harvested each day defies intuitive comprehension. According to industry estimates, approximately 2.5 quintillion bytes of data are created globally every single day, with the vast majority originating from ordinary consumer activity: social media interactions, streaming behaviour, mobile location services, biometric sensors, and smart home devices. Technology companies, advertisers, data brokers, and increasingly governments operate complex infrastructures to collect, store, and analyse this information. Much of this collection occurs passively — embedded in terms of service agreements that run to tens of thousands of words and that the overwhelming majority of users never read. The result is a structural asymmetry in which organisations possess detailed, longitudinal profiles of individuals who remain largely unaware of the breadth of data held about them.
The UN Human Rights Council and Digital Privacy
The ethical dimensions of mass data collection have attracted attention at the highest levels of international governance. The UN Human Rights Council has repeatedly affirmed that the right to privacy — enshrined in Article 12 of the Universal Declaration of Human Rights — applies fully in the digital environment. A series of UN Special Rapporteur reports have documented how the commodification of personal data enables surveillance, discrimination, and the erosion of freedoms of expression and association. Particularly concerning is the use of data collection technologies by state actors to monitor and suppress dissent, a practice documented across multiple jurisdictions. The Council's framework calls on both governments and private companies to adopt robust data protection standards grounded in necessity, proportionality, and genuine informed consent — principles that current industry practice frequently falls short of meeting.
AI Training Data and the Ethics of Scraping
The rapid expansion of AI systems has introduced new and urgent dimensions to the data ethics debate. Training a large language model or a computer vision system requires enormous quantities of data — much of which is sourced by scraping publicly accessible web content, including text, images, audio, and video created by individuals who had no expectation that their contributions would be used for this purpose. This practice raises serious questions about intellectual property, consent, and representational harm. When training datasets embed historical biases — reflecting patterns of discrimination present in the source data — AI systems can perpetuate and amplify those biases at scale. The challenge is compounded by a lack of transparency: most AI developers do not disclose the composition of their training datasets, making independent auditing and accountability virtually impossible. Advocacy groups and regulatory bodies are increasingly calling for mandatory data provenance documentation and for the legal recognition of individuals' rights over their data even in publicly accessible contexts.
Towards Responsible Data Practices
Addressing the ethics of data collection requires action at multiple levels simultaneously. At the regulatory level, frameworks such as the EU's General Data Protection Regulation (GDPR) and emerging equivalents in other jurisdictions establish important baselines — including rights of access, correction, erasure, and data portability — but enforcement remains inconsistent and penalties often fail to act as meaningful deterrents for large organisations. Organisations themselves can adopt privacy-by-design principles, minimising data collection to what is strictly necessary, anonymising where possible, and being transparent about how data is used and shared. For individuals, practical steps include using privacy-preserving browsers and search engines, reviewing app permissions regularly, and advocating for stronger regulatory standards. Ultimately, a genuinely ethical digital environment requires a cultural shift: one in which data is treated not as a resource to be extracted but as an extension of persons who retain rights over it.