Co-liberative Computing


Reading and Critical Review on Data Colonialism and Data Cleaning

For this assignment, please read the following book chapters and articles:
1. Data Feminism, Catherine D'Ignazio and Lauren F. Klein (ch. 5-6)
2. Data Colonialism: Rethinking Big Data's Relation to the Contemporary Subject, Nick Couldry and Ulises A. Mejias, Television & New Media, 2019
3. Against Cleaning, Katie Rawson and Trevor Muñoz, Debates in the digital humanities, 2019
4. Artificial Intelligence and Inclusion: Formerly Gang-Involved Youth as Domain Experts for Analyzing Unstructured Twitter Data, William R. Frey et al., Social Science Computer Review, 2020
5. Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes, Nikhil Garg et al., National Academy of Sciences, 2018
6. Datasheets for datasets, Timnit Gebru et al., Communications of the ACM, 2021
7. The Dataset Nutrition Label, Sarah Holland et al., Data Protection and Privacy, 2020
8. Documenting Data Production Processes: A Participatory Approach for Data Work, Milagros Miceli et al., 2022
After completing these readings, you are required to write a critical response addressing the following questions.

Based on papers [2-3], answer the following questions:
1. Couldry and Mejias argue that data colonialism extends control over individuals by appropriating their data. How does the practice of data cleaning fit into this framework of control, particularly when it removes "messy" data that doesn't conform to standardized categories? Is data cleaning an extension of data colonialism, and if so, how?
2. Both papers discuss the implications of data practices on individuals and communities. How do the practices of data colonialism and data cleaning affect the construction of subjectivity, particularly for marginalized groups? What are the consequences when individuals' experiences are reduced or erased by these processes?
3. How might data cleaning contribute to the erasure of intersectional identities by simplifying or categorizing data in ways that ignore complexity? What steps could data practitioners take to preserve intersectionality in their data work?

Based on papers [4-5], answer the following questions:
1. Frey et al. show that involving formerly gang-involved youth improved the reliability of AI interpretations. In what ways might the inclusion of such domain expertise challenge or validate the reliability of AI systems trained on biased word embeddings, as seen in Garg et al.?
2. What are the potential challenges of scaling participatory approaches that involve domain experts from marginalized backgrounds, as suggested by Frey et al., across larger datasets or different cultural contexts? How can researchers address these challenges to maintain inclusivity and accuracy?
3. How can researchers balance the need to remove harmful biases from word embeddings with the importance of preserving historical context? Should certain biases be retained in language models to reflect cultural history, or is the potential for harm too great?

Based on papers [6-8], answer the following questions:
1. How do these three papers each approach the goals of transparency, accountability, and ethical data documentation? Discuss the similarities and differences among these frameworks in terms of their methodology, level of detail, and degree of community involvement.
2. How do these three documentation approaches handle the identification and disclosure of biases and limitations within datasets? Do you think that either the Datasheets for Datasets or the Dataset Nutrition Labels provide adequate mechanisms for addressing biases in data production, or could participatory documentation better address this issue?
3. What trade-offs exist between transparency and practicality in data documentation? Do you think it's feasible for organizations to adopt detailed documentation frameworks like Datasheets for Datasets or Dataset Nutrition Labels at scale? Why or why not?