Reading Assignment 3

Reading and Critical Review on Data Colonialism and Data Cleaning

For this assignment, please read the following book chapters and articles:
1. Data Feminism, Catherine D'Ignazio and Lauren F. Klein (ch. 5-6)
2. Data Colonialism: Rethinking Big Data's Relation to the Contemporary Subject, Nick Couldry and Ulises A. Mejias, Television & New Media, 2019
3. The TESCREAL Bundle: Eugenics and the Promise of Utopia Through Artificial General Intelligence, Timnit Gebru and Émile P. Torres, First Monday, 2024
4. Datasheets for datasets, Timnit Gebru et al., Communications of the ACM, 2021
5. Documenting Data Production Processes: A Participatory Approach for Data Work, Milagros Miceli et al., 2022
After completing these readings, you are required to write a critical response addressing the following questions.

Based on papers [2-3], answer the following questions:
1. How do "Terms of Service" agreements shape power and consent in digital life, reproducing colonial patterns of control and making large-scale data collection seem natural or inevitable?
2. What does TESCREAL stand for, and how do its ideas reflect power, privilege, and inequality, and based on the authors' critique, what kinds of AI research should we focus on instead?
3. Do you see any overlap between data colonialism and the TESCREAL vision of AI, particularly in how both rely on ideas of progress and control that justify new forms of digital extraction and domination?

Based on papers [4-5], answer the following questions:
1. How do these two papers each approach the goals of transparency, accountability, and ethical data documentation? Discuss the similarities and differences among these frameworks in terms of their methodology, level of detail, and degree of community involvement.
2. Choose one public dataset (e.g., LAION-5B, ImageNet, Twitter Sentiment140, YouTube Faces, etc.), and based on papers [4-5], reconstruct the dataset's pipeline, focusing on the following questions:
General Overview
   - Dataset name, creators, institutions, date of release
   - Intended vs. actual use cases
   - Languages, domains, or communities represented
Data Collection
   - Where did the raw data come from (platform, source, region, etc.)?
   - How was it collected (manual, automated, scraped, crowdsourced)?
   - Who decided what to include or exclude?
Annotation and Labeling
   - Who did the annotation (crowdworkers, experts, researchers)?
   - What were their working conditions (pay, guidelines, pressure)?
   - Were they given context or discretion? How was disagreement handled?
Cleaning and Validation
   - What data was removed or modified, and why?
   - Who made these decisions?
   - Were errors, noise, or bias documented?
Storage, Access, and Circulation
   - Where is the data stored?
   - Who has access and under what conditions?
   - Are there usage restrictions or licenses?
Epistemic and Ethical Reflection
   - Whose labor and knowledge are embedded in the dataset?
   - Whose perspectives are missing or erased?
   - What risks or harms could result from its use?