Skip to contentSkip to site navigation
Computer Science
2019 Project Proposal

Computational Models of Literary Variation

Jonathan Gordon (Computer Science)

This project will investigate the possible analyses of literary corpora that computational methods can provide. In particular, we will use unsupervised topic modeling and modern data science visualization methods to produce a thematic map of a large corpus of literature. Should this succeed, it will allow a rough empirical estimate of the importance of different themes – e.g., domestic life, travel, war, Orientalism, racism – to a collection of literature. By particularizing the corpus using bounds of time and geography, this research can be a source of data for questions of thematic analysis, e.g., is there significant topical overlap between English literature depicting the First World War and earlier stories of the Grand Tour – or do stories of war have more in common with those about political or sports competitions? Additional questions to investigate include: What individual words, phrases, or themes are characteristic of a particular city, region, or country? How do these change over time? Is literary vocabulary becoming more uniform across space or is it diverging?

Required: CMPU 101; good programming skills; interest in natural language processing, digital humanities, or data science.
Preferred: CMPU 102, CMPU 145, CMPU 203. Experience with Python, Linux, NLP toolkits, visualization.

How should students express interest in this project?
Interested students should contact me by email ( to arrange a brief meeting to discuss the project.