• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

Bowdoin Science Journal

  • Home
  • About
    • Our Mission
    • Our Staff
  • Sections
    • Biology
    • Chemistry and Biochemistry
    • Math and Physics
    • Computer Science and Technology
    • Environmental Science and EOS
    • Honors Projects
    • Psychology and Neuroscience
  • Contact Us
  • Fun Links
  • Subscribe

Computer Science and Tech

AI for Language and Cultural Preservation

December 11, 2025 by Wing Kiu Lau '26

Diagram showing the computational process of translating Afáka, the script of the Ndyuka language (an English-based creole of Suriname), into English at the Missing Scripts Program

AI for Language and Cultural Preservation

Abstract

Nearly half of the world’s languages face extinction, threatening irreplaceable knowledge and cultural connections. This paper examines how artificial intelligence can support endangered language documentation and revitalization when guided by community priorities. Through case studies, from Hawaiian speech recognition to Cherokee learning platforms, the paper identifies both opportunities (improved access, engaging tools, cross-distance connection) and challenges (privacy risks, cultural appropriation, misinformation, sustainability). The central argument: effective preservation requires community leadership, robust consent frameworks, and sustained support rather than commodified technological quick fixes. The paper concludes with principles for responsible AI use that strengthens living languages and their cultural contexts.

Introduction

Approximately 40% of the world’s 6,700 languages risk extinction as speaker populations decline (Jampel, 2025). This crisis extends beyond communication loss. Languages embody cultural identity, historical memory, and community bonds. When languages disappear, speakers lose direct access to ancestral knowledge, particularly where oral histories predominate. Research suggests linguistic heritage connection correlates with improved adolescent mental health outcomes and reduced rates of certain chronic conditions.

Artificial intelligence has emerged as one approach to preservation, offering unprecedented documentation scale and interactive learning platforms. However, concerns persist beyond environmental costs, to whether technology can authentically serve community needs.

This paper argues that while AI provides powerful tools for endangered language work through natural language processing and speech recognition, success depends on careful integration with Indigenous communities’ priorities, values, and active participation.

It first examines AI’s technical foundations, real-world applications and case studies, diverse stakeholder perspectives, and future promises and challenges facing the field.

Technical Foundations: How AI Works in Language Preservation 

To understand how AI contributes to language preservation, it is helpful to see how Natural Language Processing provides the foundational methods for analyzing language, while modern language models apply these methods to learn from data and produce meaningful representations and outputs that support language preservation-focused tasks.

Natural Language Processing

Natural Language Processing (NLP) sits at the core of how AI is used to process and manipulate text. Two key areas inform language documentation: Computational Linguistics (developing tools and methods for analyzing language data) and Semantics (studying how meaning operates in language).

Semantics broadly concerns deriving meaning from language. It spans the linguistic side, where it handles lexical and grammatical meaning tied to computational linguistics, and the philosophical side, which examines distinctions between fact and fiction, emotional tone (e.g., positive, neutral, negative), and relationships between different corpora (Ali et al., 2025, p. 133).

Semantics can address challenges like word ambiguity: where lie could mean falsehood or resting horizontally. Computational Linguistics tools like Bidirectional Encoder Representations from Transformers (BERT) use contextual analysis to disambiguate such terms. Other common challenges include capturing idiomatic meanings in translation (Ali et al., 2025, p. 134).

Computational Linguistics began in the 1940s-50s, but recent advancements have driven major developments in NLP through Machine Learning and Deep Learning. Neural networks, which are data-processing structures inspired by the human brain, allow machines to learn from sample data to perform complex tasks by recognizing, classifying, and correlating patterns.

In more recent years, Generative AI has gained prominence. It relies on transformer architectures, a type of neural network that analyzes entire sequences simultaneously to determine which parts are most important, enabling effective learning from large datasets.

In short, NLP implementation involves preprocessing textual data through steps such as tokenization (breaking text into smaller units), stemming or lemmatization (reducing words to their root forms, e.g., talking → talk), and stop-word removal (eliminating common or low-value words like and, for, with). The processed data is then used to train models for specific tasks.

Common NLP applications relevant to language preservation include part-of-speech tagging, which labels words in a sentence based on their grammatical roles (e.g., nouns, verbs, adjectives, adverbs); word-sense disambiguation, which resolves multiple possible meanings of a word; speech recognition, which converts spoken language into text; machine translation, which enables translation between languages; sentiment analysis, which identifies emotional tone in text; and automatic resource mining, which involves the automated collection of linguistic resources (Amazon Web Services, n.d.).

Language Models 

BERT, developed by Google, is trained mainly with masked language modeling, where it predicts missing words from surrounding context. The original BERT also included a next sentence prediction task to judge whether one sentence follows another, although many modern variants modify or omit this objective (BERT, n.d.). Multilingual BERT (MBERT) extends this ability to multiple languages (Ali et al., 2025, p. 136).

Building on these advances, Cherokee researchers are applying and extending NLP techniques to advance language preservation and revitalization. According to Dr. David Montgomery, a citizen of the Cherokee Nation, “It would be a great service to Cherokee language learners to have a translation tool as well as an ability to draft a translation of documents for first-language Cherokee speakers to edit as part of their translation tasks” (Zhang et al., 2022, p. 1535).

To realize this potential, the research effort focuses on adapting existing NLP frameworks and creating tools specifically suited to Cherokee. Effective data collection and processing depend on capabilities such as automatic language identification and multilingual embedding models. For example, aligning Cherokee and English texts requires projecting sentences from both languages into a common semantic space to evaluate their similarity. These are capabilities that most standard NLP tools don’t provide and must be custom-built for this context (Zhang et al., 2022, p. 1535).

Real-World Applications and Case Studies 

Broadly speaking, researchers and developers are creating innovative AI solutions to support language preservation across communities.

For example, the First Languages A.I. Reality (FLAIR) Initiative develops adaptable AI tools for Indigenous language revitalization worldwide. Co-founder Michael Running Wolf (Northern Cheyenne Tribe) describes the project’s goal as increasing the number of active speakers through accessible technologies. One notable product, “Language in a Box,” is a portable, voice-based learning system that delivers customizable guided lessons for different languages (Jampel, 2025).

Indigenous scientists are also creating culturally grounded AI tools for youth engagement. Danielle Boyer developed Skobot, a talking robot designed to speak Indigenous languages (Smithsonian Magazine), while Jacqueline Brixey created Masheli, a chatbot that communicates in both English and Choctaw. Brixey notes that despite more than 220,000 enrolled Choctaw Nation members, fewer than 7,000 are fluent speakers today (Brixey, 2025)

Students with Skobots on their shoulders stand next to Danielle Boyer
Students with Skobots on their shoulders stand next to Danielle Boyer (The STEAM Connection. n.d.)

Hawaiian Language Revitalization – ASR 

A collaboration between The MITRE Corporation, University of Hawai‘i at Hilo, and University of Oxford explored Automatic Speech Recognition (ASR) for Hawaiian, a low-resource language. Using dozens of hours of labeled audio and millions of pages of digitized Hawaiian newspaper text, researchers fine-tuned models such as Whisper (large and large-v2), achieving a Word Error Rate (WER) of about 22% (Chaparala et al., 2024, p. 4). This is promising for research and assisted workflows, but it remains challenging for beginner and intermediate learners without human review.

The models struggled with key phonetic features, particularly the glottal stop (ʻokina ⟨ʻ⟩) and vowel length distinctions, due to their subtle acoustic properties. Occasionally, the model substituted spaces for glottal stops, potentially due to English linguistic patterns where glottal stops naturally occur before vowels that begin words. Hawaiian’s success with Whisper benefited from available training data, including 338 hours of Hawaiian and 1,381 hours of Māori, and its Latin-based alphabet. Other under-resourced languages lacking such advantages may face greater transcription challenges (Chaparala et al., 2024, p. 4).

Missing Scripts Initiative – Input Methods 

The Missing Scripts Initiative, led by ANRT (National School of Art and Design, France) in collaboration with UC Berkeley’s Script Encoding Initiative and the University of Applied Sciences, Mainz, addresses a major gap: nearly half of the world’s writing systems lack digital representation.

Launched in 2024 as part of the International Decade of Indigenous Languages, the initiative recognizes that beyond simply encoding these scripts into standard formats, there is the need to create functional input methods that allow users to type and interact with these writing systems. Developing these digital typefaces requires collaboration among linguists, developers, and native speakers. The initiative’s primary objectives involve encoding these scripts, a standardization process that assigns unique numerical identifiers to each character, and producing digital fonts. This work supports UNESCO’s global efforts to preserve and revitalize Indigenous linguistic heritage (UNESCO, n.d.).

Diagram showing the computational process of translating Afáka, the script of the Ndyuka language (an English-based creole of Suriname), into English at the Missing Scripts Program
Full process of translating Afáka to English computationally at the Missing Scripts Program, a script for the Ndyuka language, an English-based creole of Suriname (The Missing Scripts, n.d.)

Cherokee Case Study – Tokenization & Community-based Language Learning

Researchers at UNC Chapel Hill found that Cherokee’s strong morphological structure, where a single word can express an entire English sentence, poses unique NLP challenges. Character-level modeling using Latin script proved more effective than traditional word-level tokenization. Moreover, because Cherokee’s word order varies depending on discourse context, translating entire documents at once may be more effective than translating one sentence at a time (Zhang et al., 2022, p. 1535-1536).

Beyond technical modeling, researchers emphasized community-driven learning platforms that combine human input with AI. Inspired by systems like Wikipedia and Duolingo, these collaborative tools crowdsource content from speakers and learners. These platforms address two critical challenges simultaneously: the scarcity of training data for endangered languages and the resulting limitations in model performance. This approach transforms language learning from an individual task into a collective effort aimed at cultural preservation (Zhang et al., 2022, p. 1532).

Community Perspectives: Strengths and Concerns 

A study by Akdeniz University researchers examined community perspectives on AI for language preservation, highlighting both benefits and challenges. 

Strengths

Community members emphasized the transformative role of mobile apps in democratizing access: “Mobile apps have democratized access to our language, allowing learners from geographically dispersed areas to engage with it daily.” Interactive games and voice recognition tools make learning more engaging and accessible, while digital platforms foster connection and belonging among geographically dispersed speakers.

Translation tools and automated content generation have also proven valuable, with one linguist commenting that these technologies have been “game-changers in making our stories universally accessible.” Participants also underscored the value of cross-disciplinary collaboration, with one project manager noting that partnerships between tech developers and Indigenous communities have “opened new pathways for innovation.” AI’s adaptability was seen as another strength, allowing solutions to be customized for each language community (Soylu & Şahin, 2024. p. 15). For example, prioritizing translation tools over transcription systems depending on local needs.

Concerns

Participants also voiced serious concerns about ethics, privacy, and cultural sensitivity. One community leader stressed the importance of ensuring that “these technologies respect our cultural values and the integrity of our languages.” Limited internet infrastructure, funding instability, and intergenerational gaps remain ongoing barriers. As another participant observed, “Bridging the gap between our elders and technology is ongoing work.” Long-term sustainability depends on reliable funding and culturally informed consent practices (Soylu & Şahin, 2024. p. 15). 

The Human and Cultural Dimensions 

Focusing in on specific themes and perspectives, Indigenous innovators emphasize AI cannot replace human elders and tradition keepers. Technology should complement traditional practices like classes and intergenerational transmission. “Language is a living thing,” requiring living speakers, cultural context, and human relationships (Jampel, 2025). 

Language preservation carries profound emotional and cultural significance. It is not merely the deployment of ‘fancy technology’ but usually a response to the deep wounds caused by historical oppression, including forced assimilation, the systematic suppression of Indigenous languages, and the displacement of communities from their ancestral lands (Brixey, 2025). For many, language revitalization is not just an educational effort but an act of cultural healing and the restoration of what was forcibly taken.

Critical Concerns and Emerging Risks

Beyond community-identified challenges, broader concerns about AI’s role in language preservation have emerged, particularly regarding quality control and misinformation. In December 2024, the Montreal Gazette reported the sale of AI-generated “how-to” books for endangered languages, including Abenaki, Mi’kmaq, Mohawk, and Omok (a Siberian language extinct since the 18th century). These books contained inaccurate translations and fabricated content, which Abenaki community members described as demeaning and harmful, undermining both learners’ efforts and trust in legitimate revitalization work (Jiang, 2025). 

Many Indigenous communities also remain cautious about adopting AI. Jon Corbett, a Nehiyaw-Métis computational media artist and professor at Simon Fraser University, noted that some communities “don’t see the relevance to our culture, and they’re skeptical and wary of their contribution. Part of that is that for Indigenous people in North America, their language has been suppressed and their culture oppressed, so they’re weary of technology and what it can do” (Jiang, 2025). This caution reflects historical trauma and highlights critical questions about control, ownership, and ethical deployment of AI in cultural contexts.

Toward Ethical and Decolonized Approaches

Scholars emphasize decolonizing speech technology—respecting Indigenous knowledge systems rather than imposing Western frameworks. In 2019, Onowa McIvor and Jessica Ball, affiliated with the University of Victoria in Canada, underscored community-level initiatives supported by coherent policy and government backing (Soylu & Şahin, 2024. p. 13).

Before developing computational tools, speaker communities’ basic needs must be met: “respect, reciprocity, and understanding.” Researchers must avoid treating languages as commodities or prioritizing dataset size over community wellbeing. Common goals must be established before research begins. Only through such groundwork can AI technologies truly serve language revitalization rather than becoming another tool of extraction and exploitation (Zhang et al., 2022, p. 1531) .

These perspectives reveal that while technology offers promising pathways for language revitalization, success depends fundamentally on addressing both technical and sociocultural barriers through genuinely community-centered approaches that honor the living, relational nature of language itself. 

Future Challenges and Considerations

The Low-Resource Language Challenge

A key obstacle in applying AI to endangered languages is the lack of large training datasets. High-resource languages like English and Spanish rely on millions of parallel sentence pairs for accurate translation (Jampel, 2025), but many endangered languages have limited or no written resources. Some lack a script entirely, requiring more intensive dataset curation and multimodal approaches.

To address this, Professor Jacqueline Brixey and Dr. Ron Artstein compiled a dataset combining audio, video, and text, with many texts translated into English, allowing models to leverage multiple modalities (Brixey, 2025). Similarly, Jared Coleman at Loyola Marymount University is developing translation tools for Owens Valley Paiute, a “no-resource” language with no public datasets. His system first teaches grammar and vocabulary to the model, then has it translate using this foundation, mimicking human strategies when working with limited data. Coleman emphasizes: “Our goal isn’t perfect translation but producing outputs that accurately convey the user’s intended meaning” (Jiang, 2025).  

Capturing Linguistic and Cultural Features

Major models like ChatGPT perform poorly with Indigenous languages. Brixey notes: “ChatGPT could be good in Choctaw, but it’s currently ungrammatical; it shares misinformation about the tribe” (Jampel, 2025). Models fail to understand cultural nuance or privilege dominant culture perspectives, potentially mishandling sensitive information. These failures underscore the need for better security controls and validation mechanisms to mitigate the potential harm of linguistic misinformation. 

Technical challenges extend to basic digitization processes as well. For example, most Cherokee textual materials exist as physical manuscripts or printed books, which are readable by humans but not machine-processable. This limits applications such as automated language-learning tools. Optical Character Recognition (OCR), using systems like Tesseract-OCR and Google Vision OCR, can convert these materials into machine-readable text with reasonable accuracy. However, OCR performance is highly sensitive to image quality. Texts with cluttered layouts or illustrations, common in children’s books, often yield lower recognition rates, posing ongoing challenges for digitization and digital preservation efforts (Zhang et al., 2022, p. 1536).

Ethical and Governance Issues

The exploitation of Indigenous languages has deep historical roots that continue to shape debates on AI development. In 1890, anthropologist Jesse Walter Fewkes recorded Passamaquoddy stories and songs, some sacred and meant to remain private, but the community was denied access for nearly a century, highlighting longstanding issues of linguistic sovereignty (Jampel, 2025).

More recently, in late 2024, the Standing Rock Sioux Tribe sued an educational company for exploiting Lakota recordings without consent, profiting from tribal knowledge, and demanding extra fees to restore access (Jampel, 2025). 

In response, researchers like Brixey and Boyer implement protective measures, allowing participants to withdraw recordings and exclude their knowledge from AI development. These practices uphold data sovereignty, ensuring Indigenous communities retain control over their cultural knowledge and limiting commercialization. There is also a strong emphasis on keeping these technologies within Indigenous communities, preventing them from being commercialized or sold externally (Jampel, 2025).

As such, AI for language preservation requires clear policies for data governance and ethics. Some projects illustrate how AI can be ethically applied. New Zealand’s Te Hiku Media “Kōrero Māori” project uses AI for Māori language preservation under the Kaitiakitanga license, which forbids misuse of local data. CTO Keoni Mahelona emphasizes working with elders to record voices for transcription, demonstrating that AI tools can support Indigenous languages while respecting cultural values and community control (Jiang, 2025). Balancing technological openness with cultural sensitivity remains essential.

Resource and Infrastructure Needs

Beyond technical and ethical challenges, practical resource constraints significantly limit the scope and sustainability of language preservation initiatives. Securing funding for long-term projects remains one of the most persistent obstacles, as language revitalization requires sustained commitment over decades rather than short-term grant cycles. Training represents another critical need: communities require skilled teachers, technology experts, and materials developers who understand both the technical systems and the cultural context.

Infrastructure gaps pose fundamental barriers to participation. Many Indigenous communities lack reliable internet access and technology availability, limiting who can engage with digital language tools. Even when technologies are developed, communities need training to use and maintain AI tools independently, ensuring that these systems serve rather than create dependencies. Addressing these resource and infrastructure needs is essential for moving from pilot projects to sustainable, community-controlled language preservation ecosystems.

Conclusion

AI and NLP technologies hold significant promise for language preservation, addressing a critical need as many languages approach extinction due to declining numbers of speakers.

However, these technologies face inherent technical limitations. Low-resource languages often lack sufficient written materials or even a formal script, making model training difficult. LLMs trained primarily on English and other major languages struggle to capture the lexical, grammatical, and semantic nuances of endangered languages.

Equally important is the role of communities. Successful preservation depends on Indigenous leadership, ethical oversight, sustained collaboration, and adequate funding. AI should not be seen as a replacement for human knowledge but as one tool among many in a broader preservation toolkit.

Ultimately, digital preservation empowers communities to maintain and revitalize their linguistic heritage. Languages are living systems that thrive through active human relationships, and technology’s role is to support, not replace, these connections between people, language, and culture.

Bibliography 

Ali, M., Bhatti, Z. I., & Abbas, T. (2025). Exploring the Linguistic Capabilities and Limitations of AI for Endangered Language preservation. Journal of Development and Social Sciences, 6(2), 132–140. https://doi.org/10.47205/jdss.2025(6-II)12

BERT. (n.d.). Retrieved November 6, 2025, from https://huggingface.co/docs/transformers/en/model_doc/bert

Brixey, J. (Lina). (2025, January 22). Using Artificial Intelligence to Preserve Indigenous Languages—Institute for Creative Technologies. https://ict.usc.edu/news/essays/using-artificial-intelligence-to-preserve-indigenous-languages/

Chaparala, K., Zarrella, G., Fischer, B. T., Kimura, L., & Jones, O. P. (2024). Mai Ho’omāuna i ka ’Ai: Language Models Improve Automatic Speech Recognition in Hawaiian (arXiv:2404.03073). arXiv. https://doi.org/10.48550/arXiv.2404.03073

Digital preservation of Indigenous languages: At the intersection of. (n.d.). Retrieved November 6, 2025, from https://www.unesco.org/en/articles/digital-preservation-indigenous-languages-intersection-technology-and-culture

Jampel, S. (2025, July 31). Can A.I. Help Revitalize Indigenous Languages? Smithsonian Magazine. https://www.smithsonianmag.com/science-nature/can-ai-help-revitalize-indigenous-languages-180987060/

Jiang, M. (2025, February 22). Preserving the Past: AI in Indigenous Language Preservation. Viterbi Conversations in Ethics. https://vce.usc.edu/weekly-news-profile/preserving-the-past-ai-in-indigenous-language-preservation/

Soylu, D., & Şahin, A. (2024). The Role of AI in Supporting Indigenous Languages. AI and Tech in Behavioral and Social Sciences, 2(4), 11–18. https://doi.org/10.61838/kman.aitech.2.4.2

Students with Skobots on their shoulders stand next to Danielle Boyer. The STEAM Connection. (n.d.). [Graphic]. Retrieved November 6, 2025, from https://th-thumbnailer.cdn-si-edu.com/8iThtG8bZkWMxUq0goedXRXlzio=/fit-in/1072×0/filters:focal(616×411:617×412)/https://tf-cmsv2-smithsonianmag-media.s3.amazonaws.com/filer_public/a5/f3/a5f3877c-f738-423d-bcff-8a44efcbe48f/danielle-boyer-and-student-wearing-skobots_web.jpg

The Missing Scripts. (n.d.). [Graphic]. Retrieved November 6, 2025, from https://sei.berkeley.edu/the-missing-scripts/

What is NLP? – Natural Language Processing Explained – AWS. (n.d.). Amazon Web Services, Inc. Retrieved November 6, 2025, from https://aws.amazon.com/what-is/nlp/

Zhang, S., Frey, B., & Bansal, M. (2022). How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1529–1541). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.108

 

Filed Under: Computer Science and Tech Tagged With: AI, AI ethics, artificial intelligence, Computer Science and Tech, nlp, Technology

Unsupervised Thematic Clustering for Genre Classification in Literary Texts

May 4, 2025 by Wing Kiu Lau '26

Figure depicting the influence of distance metrics on ARI scores for each feature type.
Book genres
(Chapterly 2022)

Summary

In the last decade, computational literary studies have expanded, yet computational thematics remains less explored than areas like stylometry, which focuses on identifying stylistic similarities between texts. A 2024 study by researchers from the Max Planck Institute and the Polish Academy of Sciences investigated the most effective computational methods for measuring thematic similarity in literary texts, aiming to improve automated genre clustering.

Key Findings and Assumptions

  • Key Assumptions: 
    • Text pre-processing to emphasize thematic content over stylistic features could improve genre clustering. 
    • Unsupervised clustering would offer a more scalable and objective approach to genre categorization than manual tagging by humans.
    • Four genres were selected (detective, fantasy, romance, science fiction) for their similar level of broad qualities.
    • If the genres are truly distinct in terms of themes, computers should be able to separate them into clusters.
  • Best Performance: The best algorithms were 66-70% accurate at grouping books by genre. Thus showing unsupervised genre clustering is feasible despite the complexity of literary texts.
  • Text Pre-Processing: Medium and strong levels of text pre-processing significantly improved clustering, while weak pre-processing performed poorly.
  • Which methods worked best: Doc2vec, a method that captures word meaning and context, performed the best overall, followed by LDA (Latent Dirichlet Allocation), which finds major topics in texts. Even the simpler bag-of-words method, which just counts how often words appear, gave solid results.
  • Best way to compare genres: Jensen-Shannon divergence, which compares probability distributions, was the most effective metric, while simpler metrics like Euclidean distance performed poorly for genre clustering.

Methodology 

Sample Selection

The researchers selected canonical books from each of the four genres, ensuring they were from the same time period to control for language consistency.

Sample Pre-Processing and Analysis 

The researchers analyzed all 291 combinations of the techniques in each of the three stages: text pre-processing, feature extraction, and measuring text similarity. 

Stage 1: Different Levels of Text Pre-Processing  

  • The extent to which the text is simplified and cleaned up.
    • Weak → lemmatizing (reducing words to their base or dictionary form (e.g., “running” to “run”), removing 100 Most Frequent Words
    • Medium → lemmatizing, using only nouns, adjectives, verbs, and adverbs, removing character names
    • Strong → Same as medium, but also replaced complex words with simpler versions.

Stage 2: Identifying Key Text Features through Extraction Methods

  • Transforming pre-processed texts into feature lists.
    • Bag-of-Words → Counts how often each word appears.
    • Latent Dirichlet Allocation (LDA) → Tries to discover dominant topics across books.
    • Weighted Gene Co-expression Network Analysis (WGCNA) → A method borrowed from genetics to find clusters of related words.
    • Document-Level Embeddings (doc2vec) → Captures semantic relationships (connections between words based on their meanings (e.g., “dog” and “cat”)) for similarity assessment.

Stage 3: Distance metric (Measuring Text Similarity)

  • Quantifying similarity with metrics. 6 metrics were chosen: 
    • Euclidean, Manhattan, Delta, Cosine Delta, Cosine, Jensen-Shannon divergence 

To minimize the influence of individual books on the clustering results, rather than analyzing the full corpus at once, the researchers used multiple smaller samples. Each sample consisted of 30 books per genre (120 books total), and this sampling process was repeated 100 times for each combination. Additionally, models requiring training (LDA, WGCNA, and doc2vec) were retrained for each sample to reduce potential biases.

Clustering and Validation

The researchers applied Ward’s clustering algorithm on the distances, grouping texts into four clusters based on genre similarity. They then checked how well these clusters matched the actual genres of the books. To do this, they used a scoring system called the Adjusted Rand Index (ARI), which gives a number between 0 (least accurate) to 1 (most accurate). 

The results were visualized using a map projection, grouping similar books closer together, and revealing the underlying thematic structures and relationships among the novels.

Core Findings and Figures  

Results  

The best algorithms grouped literary texts with 66-70% accuracy, demonstrating that unsupervised clustering of fiction genres is feasible despite text complexity. Successful methods consistently used strong text pre-processing, emphasizing the importance of text cleaning and simplification to focus more on a book’s themes rather than its writing style.

Among the top features, six of the ten were based on LDA topics, proving its effectiveness in genre classification. Additionally, eight of the best distance metrics used Jensen–Shannon divergence, suggesting it is highly effective for genre differentiation.

Generalizability  

To assess generalizability, five statistical tests were used to analyze interactions between text pre-processing, feature extraction methods, distance metrics, and other factors. These models provided insights into the broader effectiveness of various methods for thematic analysis.

Text Pre-Processing and Genre Clustering  

Text pre-processing improves genre clustering, with low pre-processing performing the worst across all feature types. Medium and strong pre-processing showed similar results, suggesting replacing complex words with simpler words offers minimal improvements in genre recognition. 

The benefits of strong text pre-processing for document embeddings, LDA, and bag-of-words were minimal and inconsistent. The figure below suggests a positive correlation between Most Frequent Words and ARI and the degree of text pre-processing and ARI. This demonstrates that how we prepare texts matters just as much as what algorithms we use. Moreover, researchers can save time by avoiding replacing complex words with simpler words if medium and strong pre-processing show similar results. 

Figure depicting the influence of the number of Most Frequent Words, used as text features, on the model’s ability to detect themes, measured with ARI.
Fig 1. The influence of the number of Most Frequent Words, used as text features, on the model’s ability to detect themes, measured with ARI (Sobchuk and Šeļa, 2024, Figure 6).

Feature Types and Their Performance  

Doc2vec, which looks at how words relate to each other in meaning, performed best on average, followed by LDA, which remained stable across various settings, such as topic numbers and the number of Most Frequent Words. Perhaps researchers can use this method without excessive parameter tuning. The simple bag-of-words approach performed well despite its low computational cost, perhaps suggesting even basic approaches can compete with more complex models. WGCNA performed the worst on average, suggesting methods from other fields need careful adaptation before use.

LDA Performance and Parameter Sensitivity  

The performance of LDA did not significantly depend on the number of topics or the number of Most Frequent Words being tracked. The key factor influencing thematic classification was text pre-processing, with weak pre-processing significantly reducing ARI scores. Hence, this underscores the need for further research on text pre-processing, given its key role in the effectiveness of LDA and overall genre classification.  

Bag-of-Words Optimization

The effectiveness of Bag-of-Words depended on a balance between text pre-processing and how many Most Frequent Words are tracked. While increases in Most Frequent Words from 1,000 to 5,000 and medium text pre-processing significantly improved accuracy scores, further increases provided minimal gains. This ‘sweet spot’ means projects can achieve good results without maxing out computational resources, making computational thematics more accessible to smaller research teams and institutions.

Best and Worst Distance Metrics for Genre Recognition  

Jensen–Shannon divergence, which compares probability distributions, was the best choice for grouping similar genres, especially when used with LDA and bag-of-words. The Delta and Manhattan methods also worked reasonably well. Euclidean was the worst performer across LDA, bag-of-words, and WGCNA despite its widespread use in text analysis, suggesting further research is needed to replace industry-standard metrics. Cosine distance, while effective for authorship attribution, was not ideal for measuring LDA topic distances. Doc2vec is less affected by the comparison method used. 

Figure depicting the influence of distance metrics on ARI scores for each feature type.
Fig 2. The influence of distance metrics on ARI scores for each feature type (Sobchuk and Šeļa, 2024, Figure 3).

Main Findings  

Unsupervised learning can detect thematic similarities, though performance varies. Methods like cosine distance, used in authorship attribution, are less effective for thematic analysis when used with minimal preprocessing and a small number of Most Frequent Words.

Reliable thematic analysis can improve large-scale problems of inconsistent manual genre tagging in digital libraries and identifying unclassified or undiscovered genres. Additionally, it can enhance book recommendation systems by enabling content-based similarity detection instead of solely relying on user behavior. Much like how Spotify suggests songs based on acoustic features.

Conclusion  

This study demonstrates the value of computational methods in literary analysis, showing how thematic clustering can enhance genre classification and literary evolution. It establishes a foundation for future large-scale literary studies.

Limitations  

Key limitations include the simplification of complex literary relationships in clustering, which despite reducing complex literary relationships into more manageable structures, may not work the same way with different settings or capture every important textual feature.

The study also did not separate thematic content from elements like narrative perspective. Additionally, genre classification remains subjective and ambiguous, and future work could explore alternative approaches, such as user-generated tags from sites like Goodreads.

Implications and Future Research  

This research provides a computational framework for thematic analysis, offering the potential for improving genre classification and book recommendation systems. Future work should incorporate techniques like BERTopic and Top2Vec, test these methods on larger and more diverse datasets, and further explore text simplification and clustering strategies.

Bibliography 

Sobchuk, O., Šeļa, A. Computational thematics: comparing algorithms for clustering the genres of literary fiction. Humanit Soc Sci Commun 11, 438 (2024). https://doi.org/10.1057/s41599-024-02933-6

Book genres. (2022). Chapterly. Retrieved May 4, 2025, from https://www.chapterly.com/blog/popular-and-lucrative-book-genres-for-authors.

Filed Under: Computer Science and Tech Tagged With: Computational Analysis, Computer Science, Computer Science and Tech, Machine Learning

Computer Vision Ethics

May 4, 2025 by Madina Sotvoldieva '28

Computer vision (CV) is a field of computer science that allows computers to “see” or, in more technical terms, recognize, analyze, and respond to visual data, such as videos and images. CV is widely used in our daily lives, from something as simple as recognizing handwritten text to something as complex as analyzing and interpreting MRI scans. With the advent of AI in the last few years, CV has also been improving rapidly. However, just like any subfield of AI nowadays, CV has its own set of ethical, social, and political implications, especially when used to analyze people’s visual data.

Although CV has been around for some time, there is limited work on its ethical limitations in the general AI field. Among the existing literature, authors categorized six ethical themes, which are espionage, identity theft, malicious attacks, copyright infringement, discrimination, and misinformation [1]. As seen in Figure 1, one of the main CV applications is face recognition, which could also lead to issues of error, function creep (the expansion of technology beyond its original purposes), and privacy. [2].

Computer Vision technologies related to Identity Theft
Figure 1: Specific applications of CV that could be used for Identity Theft.

To discuss CV’s ethics, the authors of the article take a critical approach to evaluating the implications through the framework of power dynamics. The three types of power that are analyzed are dispositional, episodic, and systemic powers [3]. 

Dispositional Power

Dispositional power is defined as the ability to bring out a significant outcome [4]. When people gain that power, they feel empowered to explore new opportunities, and their scope of agency increases (they become more independent in their actions) [5]. However, CV can threaten this dispositional power in several ways, ultimately reducing people’s autonomy. 

One way CV disempowers people is by limiting their information control. Since CV works with both pre-existing and real-time camera footage, people might be often unaware that they are being recorded and often cannot avoid that. This means that technology makes it hard for people to control the data that is being gathered about them, and protecting their personal information might get as extreme as hiding their faces. 

Apart from people being limited in controlling what data is being gathered about them, advanced technologies make it extremely difficult for an average person to know what specific information can be retrieved from visual data. Another way CV might disempower people of following their own judgment is through communicating who they are for them (automatically inferring people’s race, gender, and mood), creating a forced moral environment (where people act from fear of being watched rather than their own intentions), and potentially leading to over-dependence on computers (e.g., relying on face recognition for emotion interpretations). 

In all these and other ways, CV undermines the foundation of dispositional power by limiting people’s ability to control their information, make independent decisions, express themselves, and act freely.

Episodic Power

Episodic power, or as often referred to as power-over, defines the direct exercise of power by one individual or group over another. CV can both give new power or improve the efficiency of existing one [6]. While this isn’t always a bad thing (for example, parents watching over children), problems arise when CV makes that control too invasive or one-sided—especially in ways that limit people’s freedom to act independently.

 With CV taking security cameras to the next level, opportunities such as baby-room monitoring or fall detection for elderly people open up to us. However, it also leads to the issues of surveillance automation, which can lead to over-enforcement in scales as small as private individuals to bigger corporations (workplaces, insurance companies, etc.). Another power dynamic shifts that need to be considered, for example, when the smart doorbells show far beyond the person at the door and might violate a neighbor’s privacy by creating peer-to-peer surveillance. 

These examples show that while CV may offer convenience or safety, it can also tip power balances in ways that reduce personal freedom and undermine one’s autonomy.

Systemic Power

Systematic power is not viewed as an individual exercise of power, but rather a set of societal norms and practices that affect people’s autonomy by determining what opportunities people have, what values they hold, and what choices they make. CV can strengthen the systematic power by making law enforcement more efficient through smart cameras and increase businesses’ profit through business intelligence tools. 

However, CV can also reinforce the pre-existing systematic societal injustices. One example of that might be flawed facial recognition, when the algorithms are more likely to recognize White people and males [7], which led to a number of false arrests. This might lead to people receiving unequal opportunities (when biased systems are used for hiring process), or harm their self-worth (when falsely recognized as a criminal). 

Another matter of systematic power is the environmental cost of CV. AI systems rely on vast amounts of data, which requires intensive energy for processing and storage. As societies become increasingly dependent on AI technologies like CV, those trying to protect the environment have little ability to resist or reshape these damaging practices. The power lies with tech companies and industries, leaving citizens without the means to challenge the system. When the system becomes harder to challenge or change, that’s when the ethical concerns regarding CV arise.

Conclusion

Computer Vision is a powerful tool that keeps evolving each year. We already see numerous applications of it in our daily lives, starting from the self-checkouts in the stores and smart doorbells to autonomous vehicles and tumor detections. With the potential that CV holds in improving and making our lives safer, there are a number of ethical limitations that should be considered. We need to critically examine how CV affects people’s autonomy, might cause one-sided power dynamics, and reinforces societal prejudices. As we are rapidly transitioning into the AI-driven world, there is more to come in the field of computer vision. However, in the pursuit of innovation, we should ensure the progress does not come at the cost of our ethical values.

References:

[1] Lauronen, M.: Ethical issues in topical computer vision applications. Information Systems, Master’s Thesis. University of Jyväskylä. (2017). https://jyx.jyu.fi/bitstream/handle/123456789/55806/URN%3aNBN%3afi%3ajyu-201711084167.pdf?sequence=1&isAllowed=y

[2] Brey, P.: Ethical aspects of facial recognition systems in public places. J. Inf. Commun. Ethics Soc. 2(2), 97–109 (2004). https:// doi.org/10.1108/14779960480000246

[3] Haugaard, M.: Power: a “family resemblance concept.” Eur. J. Cult. Stud. 13(4), 419–438 (2010)

[4] Morriss, P.: Power: a philosophical analysis. Manchester University Press, Manchester, New York (2002)

[5] Morriss, P.: Power: a philosophical analysis. Manchester University Press, Manchester, New York (2002)

[6] Brey, P.: Ethical aspects of facial recognition systems in public places. J. Inf. Commun. Ethics Soc. 2(2), 97–109 (2004). https://doi.org/10.1108/14779960480000246

[7] Buolamwini, J., Gebru, T.: Gender shades: intersectional accuracy disparities in commercial gender classification. Conference on Fairness, Accountability, and Transparency, pp. 77–91 (2018) Coeckelbergh, M.: AI ethics. MIT Press (2020)

Filed Under: Computer Science and Tech, Science Tagged With: AI, AI ethics, artificial intelligence, Computer Science and Tech, Computer Vision, Ethics, Technology

Primary Sidebar

CATEGORY CLOUD

Biology Chemistry and Biochemistry Computer Science and Tech Environmental Science and EOS Honors Projects Math and Physics Psychology and Neuroscience Science

RECENT POSTS

  • Venom As Medicine: Novel Pathways for Dravet Syndrome Treatment Using Modulatory Peptides from Scorpion Venom January 8, 2026
  • When Distraction Helps: Music as a Tool for Focus in ADHD Cases December 24, 2025
  • Timing Matters: How the menstrual cycle influences ACL injury in Female athletes December 24, 2025

FOLLOW US

  • Facebook
  • Twitter

Footer

TAGS

AI AI ethics Alzheimer's Disease antibiotics artificial intelligence bacteria Bathymetry Biology brain Cancer Biology Cell Biology Chemistry and Biochemistry Chlorofluorocarbons climate change Computer Science and Tech CRISPR Cytoskeleton Depression Dermatology dreams epigenetics Ethics Genes Gut microbiota honors Marine Biology Marine Mammals Marine noise Medicine memory Montreal Protocol Moss neurobiology neuroscience Nutrients Ozone hole Plants Psychology and Neuroscience REM seabirds sleep student Technology therapy Women's health

Copyright © 2026 · students.bowdoin.edu