Vector space explorations of literary language

A new article (open access, peer-reviewed) resulting from The Riddle of Literary Quality has appeared in the journal Language Resources and Evaluation. Previous work already showed that literary novels can be recognized successfully with textual features. This article shows that literature can even be recognized with short fragments (2-3 pages), and also considers judgments of quality and novels that respondents had not read. The textual features are automatically learned document representations that require no feature engineering and are only based on word frequencies.

In addition, the article tests the hypothesis that literary novels are more complex than non-literary novels. By measuring the similarity and variety of topics in the novels, literary novels are shown to stand out more than non-literary novels.

A keyword analysis uncovers some of the stylistic markers that explain the success of the predictive models. It also highlights certain biases related to genre and gender, both in the data and the models. Nevertheless, we find that the greater part of factors affecting judgments of literariness are explicable in terms of word frequencies, even in short text fragments and among novels with higher literary ratings.

van Cranenburgh, A., van Dalen-Oskam, K. & van Zundert, J. (2019), Language Resources & Evaluation. https://doi.org/10.1007/s10579-018-09442-4