{"id":1982,"date":"2025-05-04T12:46:02","date_gmt":"2025-05-04T16:46:02","guid":{"rendered":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/?p=1982"},"modified":"2025-05-04T12:46:02","modified_gmt":"2025-05-04T16:46:02","slug":"unsupervised-thematic-clustering-for-genre-classification-in-literary-texts","status":"publish","type":"post","link":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/csci-tech\/unsupervised-thematic-clustering-for-genre-classification-in-literary-texts\/","title":{"rendered":"Unsupervised Thematic Clustering for Genre Classification in Literary Texts"},"content":{"rendered":"<figure id=\"attachment_2019\" aria-describedby=\"caption-attachment-2019\" style=\"width: 378px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-2019\" src=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/lucrative-genres-for-authors-300x157.png\" alt=\"Book genres\" width=\"378\" height=\"198\" srcset=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/lucrative-genres-for-authors-300x157.png 300w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/lucrative-genres-for-authors-1024x535.png 1024w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/lucrative-genres-for-authors-768x401.png 768w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/lucrative-genres-for-authors.png 1200w\" sizes=\"auto, (max-width: 378px) 100vw, 378px\" \/><figcaption id=\"caption-attachment-2019\" class=\"wp-caption-text\">(Chapterly 2022)<\/figcaption><\/figure>\n<h1><span style=\"font-weight: 400\">Summary<\/span><\/h1>\n<p><span style=\"font-weight: 400\">In the last decade, computational literary studies have expanded, yet computational thematics remains less explored than areas like stylometry, which focuses on identifying stylistic similarities between texts. A 2024 study by researchers from the Max Planck Institute and the Polish Academy of Sciences investigated the most effective computational methods for measuring thematic similarity in literary texts, aiming to improve automated genre clustering.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Key Findings and Assumptions<\/span><\/h2>\n<ul>\n<li style=\"font-weight: 400\"><b>Key Assumptions:<\/b><span style=\"font-weight: 400\">\u00a0<\/span>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Text pre-processing to emphasize thematic content over stylistic features could improve genre clustering.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Unsupervised clustering would offer a more scalable and objective approach to genre categorization than manual tagging by humans.<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Four genres were selected (detective, fantasy, romance, science fiction) for their similar level of broad qualities.<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">If the genres are truly distinct in terms of themes, computers should be able to separate them into clusters.<\/span><\/li>\n<\/ul>\n<\/li>\n<li style=\"font-weight: 400\"><b>Best Performance:<\/b><span style=\"font-weight: 400\"> The best algorithms were 66-70% accurate at grouping books by genre. Thus showing unsupervised genre clustering is feasible despite the complexity of literary texts.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Text Pre-Processing:<\/b><span style=\"font-weight: 400\"> Medium and strong levels of text pre-processing significantly improved clustering, while weak pre-processing performed poorly.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Which methods worked best:<\/b><span style=\"font-weight: 400\"> Doc2vec, a method that captures word meaning and context, performed the best overall, followed by LDA (Latent Dirichlet Allocation), which finds major topics in texts. Even the simpler bag-of-words method, which just counts how often words appear, gave solid results.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Best way to compare genres:<\/b><span style=\"font-weight: 400\"> Jensen-Shannon divergence, which compares probability distributions, was the most effective metric, while simpler metrics like Euclidean distance performed poorly for genre clustering.<\/span><\/li>\n<\/ul>\n<h1><span style=\"font-weight: 400\">Methodology\u00a0<\/span><\/h1>\n<h2><span style=\"font-weight: 400\">Sample Selection<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The researchers selected canonical books from each of the four genres, ensuring they were from the same time period to control for language consistency.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Sample Pre-Processing and Analysis\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The researchers analyzed all 291 combinations of the techniques in each of the three stages: text pre-processing, feature extraction, and measuring text similarity.\u00a0<\/span><\/p>\n<p><b>Stage 1: Different Levels of Text Pre-Processing\u00a0\u00a0<\/b><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">The extent to which the text is simplified and cleaned up.<\/span>\n<ul>\n<li style=\"font-weight: 400\"><b>Weak \u2192 <\/b><span style=\"font-weight: 400\">lemmatizing (reducing words to their base or dictionary form (e.g., &#8220;running&#8221; to &#8220;run&#8221;), removing 100 Most Frequent Words<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Medium<\/b><span style=\"font-weight: 400\"> \u2192 lemmatizing, using only nouns, adjectives, verbs, and adverbs, removing character names<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Strong<\/b><span style=\"font-weight: 400\"> \u2192 Same as medium, but also replaced complex words with simpler versions.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><b>Stage 2: Identifying Key Text Features through Extraction Methods<\/b><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Transforming pre-processed texts into feature lists.<\/span>\n<ul>\n<li style=\"font-weight: 400\"><b>Bag-of-Words <\/b><span style=\"font-weight: 400\">\u2192 Counts how often each word appears.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Latent Dirichlet Allocation (LDA) <\/b><span style=\"font-weight: 400\">\u2192 Tries to discover dominant topics across books.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Weighted Gene Co-expression Network Analysis (WGCNA) <\/b><span style=\"font-weight: 400\">\u2192 A method borrowed from genetics to find clusters of related words.<\/span><\/li>\n<li style=\"font-weight: 400\"><b>Document-Level Embeddings (doc2vec)<\/b><span style=\"font-weight: 400\"> \u2192 Captures semantic relationships (connections between words based on their meanings (e.g., &#8220;dog&#8221; and &#8220;cat&#8221;)) for similarity assessment.<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><b>Stage 3: Distance metric (Measuring Text Similarity)<\/b><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Quantifying similarity with metrics. 6 metrics were chosen:\u00a0<\/span>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Euclidean, Manhattan, Delta, Cosine Delta, Cosine, Jensen-Shannon divergence\u00a0<\/span><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><span style=\"font-weight: 400\">To minimize the influence of individual books on the clustering results, rather than analyzing the full corpus at once, the researchers used multiple smaller samples. Each sample consisted of 30 books per genre (120 books total), and this sampling process was repeated 100 times for each combination. Additionally, models requiring training (LDA, WGCNA, and doc2vec) were retrained for each sample to reduce potential biases.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Clustering and Validation<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The researchers applied Ward\u2019s clustering algorithm on the distances, grouping texts into four clusters based on genre similarity. They then checked how well these clusters matched the actual genres of the books. To do this, they used a scoring system called the Adjusted Rand Index (ARI), which gives a number between 0 (least accurate) to 1 (most accurate).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">The results were visualized using a map projection, grouping similar books closer together, and revealing the underlying thematic structures and relationships among the novels.<\/span><\/p>\n<h1><span style=\"font-weight: 400\">Core Findings and Figures\u00a0\u00a0<\/span><\/h1>\n<h2><span style=\"font-weight: 400\">Results\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The best algorithms grouped literary texts with 66-70% accuracy, demonstrating that unsupervised clustering of fiction genres is feasible despite text complexity. Successful methods consistently used strong text pre-processing, emphasizing the importance of text cleaning and simplification to focus more on a book\u2019s themes rather than its writing style.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Among the top features, six of the ten were based on LDA topics, proving its effectiveness in genre classification. Additionally, eight of the best distance metrics used Jensen\u2013Shannon divergence, suggesting it is highly effective for genre differentiation.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Generalizability\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">To assess generalizability, five statistical tests were used to analyze interactions between text pre-processing, feature extraction methods, distance metrics, and other factors. These models provided insights into the broader effectiveness of various methods for thematic analysis.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Text Pre-Processing and Genre Clustering\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Text pre-processing improves genre clustering, with low pre-processing performing the worst across all feature types. Medium and strong pre-processing showed similar results, suggesting replacing complex words with simpler words offers minimal improvements in genre recognition.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">The benefits of strong text pre-processing for document embeddings, LDA, and bag-of-words were minimal and inconsistent. The figure below suggests a positive correlation between Most Frequent Words and ARI and the degree of text pre-processing and ARI. This demonstrates that how we prepare texts matters just as much as what algorithms we use. Moreover, researchers can save time by avoiding replacing complex words with simpler words if medium and strong pre-processing show similar results.\u00a0<\/span><\/p>\n<figure id=\"attachment_2002\" aria-describedby=\"caption-attachment-2002\" style=\"width: 563px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-2002\" src=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig6_HTML-300x226.png\" alt=\"Figure depicting the influence of the number of Most Frequent Words, used as text features, on the model\u2019s ability to detect themes, measured with ARI.\" width=\"563\" height=\"424\" srcset=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig6_HTML-300x226.png 300w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig6_HTML-1024x772.png 1024w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig6_HTML-768x579.png 768w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig6_HTML.png 1500w\" sizes=\"auto, (max-width: 563px) 100vw, 563px\" \/><figcaption id=\"caption-attachment-2002\" class=\"wp-caption-text\">Fig 1. The influence of the number of Most Frequent Words, used as text features, on the model\u2019s ability to detect themes, measured with ARI (Sobchuk and \u0160e\u013ca, 2024, Figure 6).<\/figcaption><\/figure>\n<h2><span style=\"font-weight: 400\">Feature Types and Their Performance\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Doc2vec, which looks at how words relate to each other in meaning, performed best on average, followed by LDA, which remained stable across various settings, such as topic numbers and the number of Most Frequent Words. Perhaps researchers can use this method without excessive parameter tuning. The simple bag-of-words approach performed well despite its low computational cost, perhaps suggesting even basic approaches can compete with more complex models. WGCNA performed the worst on average, suggesting methods from other fields need careful adaptation before use.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">LDA Performance and Parameter Sensitivity\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The performance of LDA did not significantly depend on the number of topics or the number of Most Frequent Words being tracked. The key factor influencing thematic classification was text pre-processing, with weak pre-processing significantly reducing ARI scores. Hence, this underscores the need for further research on text pre-processing, given its key role in the effectiveness of LDA and overall genre classification.\u00a0\u00a0<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Bag-of-Words Optimization<\/span><\/h2>\n<p><span style=\"font-weight: 400\">The effectiveness of Bag-of-Words depended on a balance between text pre-processing and how many Most Frequent Words are tracked. While increases in Most Frequent Words from 1,000 to 5,000 and medium text pre-processing significantly improved accuracy scores, further increases provided minimal gains. This \u2018sweet spot\u2019 means projects can achieve good results without maxing out computational resources, making computational thematics more accessible to smaller research teams and institutions.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Best and Worst Distance Metrics for Genre Recognition\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Jensen\u2013Shannon divergence, which compares probability distributions, was the best choice for grouping similar genres, especially when used with LDA and bag-of-words. The Delta and Manhattan methods also worked reasonably well. Euclidean was the worst performer across LDA, bag-of-words, and WGCNA despite its widespread use in text analysis, suggesting further research is needed to replace industry-standard metrics. Cosine distance, while effective for authorship attribution, was not ideal for measuring LDA topic distances. Doc2vec is less affected by the comparison method used.\u00a0<\/span><\/p>\n<figure id=\"attachment_2001\" aria-describedby=\"caption-attachment-2001\" style=\"width: 507px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-2001\" src=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig3_HTML-300x226.png\" alt=\"Figure depicting the influence of distance metrics on ARI scores for each feature type.\" width=\"507\" height=\"382\" srcset=\"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig3_HTML-300x226.png 300w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig3_HTML-1024x771.png 1024w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig3_HTML-768x578.png 768w, https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-content\/uploads\/sites\/35\/2025\/05\/41599_2024_2933_Fig3_HTML.png 1500w\" sizes=\"auto, (max-width: 507px) 100vw, 507px\" \/><figcaption id=\"caption-attachment-2001\" class=\"wp-caption-text\">Fig 2. The influence of distance metrics on ARI scores for each feature type (Sobchuk and \u0160e\u013ca, 2024, Figure 3).<\/figcaption><\/figure>\n<h2><span style=\"font-weight: 400\">Main Findings\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Unsupervised learning can detect thematic similarities, though performance varies. Methods like cosine distance, used in authorship attribution, are less effective for thematic analysis when used with minimal preprocessing and a small number of Most Frequent Words.<\/span><\/p>\n<p><span style=\"font-weight: 400\">Reliable thematic analysis can improve large-scale problems of inconsistent manual genre tagging in digital libraries and identifying unclassified or undiscovered genres. Additionally, it can enhance book recommendation systems by enabling content-based similarity detection instead of solely relying on user behavior. Much like how Spotify suggests songs based on acoustic features.<\/span><\/p>\n<h1><span style=\"font-weight: 400\">Conclusion\u00a0\u00a0<\/span><\/h1>\n<p><span style=\"font-weight: 400\">This study demon<\/span>strates the value of computational methods in literary analysis, showing how thematic clustering can enhance genre classification and literary evolution. It establishes a foundation for future large-scale literary studies.<\/p>\n<h2><span style=\"font-weight: 400\">Limitations\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Key limitations include the simplification of complex literary relationships in clustering, which despite reducing complex literary relationships into more manageable structures, may not work the same way with different settings or capture every important textual feature.<\/span><\/p>\n<p><span style=\"font-weight: 400\">The study also did not separate thematic content from elements like narrative perspective. Additionally, genre classification remains subjective and ambiguous, and future work could explore alternative approaches, such as user-generated tags from sites like Goodreads.<\/span><\/p>\n<h2><span style=\"font-weight: 400\">Implications and Future Research\u00a0\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">This research provides a computational framework for thematic analysis, offering the potential for improving genre classification and book recommendation systems. Future work should incorporate techniques like BERTopic and Top2Vec, test these methods on larger and more diverse datasets, and further explore text simplification and clustering strategies.<\/span><\/p>\n<h1><span style=\"font-weight: 400\">Bibliography\u00a0<\/span><\/h1>\n<p><span style=\"font-weight: 400\">Sobchuk, O., \u0160e\u013ca, A. Computational thematics: comparing algorithms for clustering the genres of literary fiction. Humanit Soc Sci Commun 11, 438 (2024). <\/span><a href=\"https:\/\/doi.org\/10.1057\/s41599-024-02933-6\"><span style=\"font-weight: 400\">https:\/\/doi.org\/10.1057\/s41599-024-02933-6<\/span><\/a><\/p>\n<p><i>Book genres<\/i>. (2022). Chapterly. Retrieved May 4, 2025, from <a href=\"https:\/\/www.chapterly.com\/blog\/popular-and-lucrative-book-genres-for-authors.\">https:\/\/www.chapterly.com\/blog\/popular-and-lucrative-book-genres-for-authors.<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary In the last decade, computational literary studies have expanded, yet computational thematics remains less explored than areas like stylometry, which focuses on identifying stylistic similarities between texts. A 2024 study by researchers from the Max Planck Institute and the Polish Academy of Sciences investigated the most effective computational methods for measuring thematic similarity in [&hellip;]<\/p>\n","protected":false},"author":744,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[65],"tags":[250,249,60,251],"class_list":{"0":"post-1982","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-csci-tech","7":"tag-computational-analysis","8":"tag-computer-science","9":"tag-csci-tech","10":"tag-machine-learning","11":"entry","12":"has-post-thumbnail"},"featured_image_src":null,"featured_image_src_square":null,"author_info":{"display_name":"Wing Kiu Lau","author_link":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/author\/wlau\/"},"_links":{"self":[{"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/posts\/1982","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/users\/744"}],"replies":[{"embeddable":true,"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/comments?post=1982"}],"version-history":[{"count":0,"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/posts\/1982\/revisions"}],"wp:attachment":[{"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/media?parent=1982"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/categories?post=1982"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/students.bowdoin.edu\/bowdoin-science-journal\/wp-json\/wp\/v2\/tags?post=1982"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}