The Riddle goes to Atlanta | Literary Quality

The 2013 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL2013) and the 2013 Workshop for Computational Linguistics for Literature (CLFL2013)

Corina Koolen, Kim Jautze and Andreas van Cranenburgh

For a brief while, Atlanta (Georgia) was flooded with computational linguists from over the world—to the extent that border security sighed of boredom when we announced our intention in entering the USA. Machine translation, summarization, sentiment analysis, etc.—many specialisms were represented. But we came with prose fiction in mind.

Metaphors, cliches and tree fragment selection
In the workshop on multiword expressions a paper was presented on detecting how cliched a text is, with a surprisingly simple and effective method. Instead of trying to define and identify cliches directly, the method relies on a large amount of word co-occurrence statistics (the Google n-gram corpus), and shows that the amount of cliches in a text is reflected in the skew of the frequency distribution of its n-grams. Namely, cliched texts contain a larger proportion of n-grams that are frequent in the reference corpus.

Another interesting paper attempts to identify metaphorical word use through tree-kernels. Tree kernels find recurring substructures in parse trees, and are here applied to distinguish for example “sweet tea” from “a sweet person,” by detecting that in the metaphorical case, the combination of words is semantically anomalous.
Lastly, a conference paper on native language detection with tree-substitution grammars presents interesting techniques for making sense of large collections of tree fragments.

In the paper on literary authorship attribution, which was the Riddle’s contribution to this conference last year, two problems are encountered:

redundancy: removing fragments that are trivial, or variants of others
relevancy: which are the interesting fragments, or most predictive of an author’s style?

This paper on native language detection addresses these problems with common statistical methods. It will be interesting to see whether they work with literature as well.

Prose fiction versus newswire
Kathleen McKeown, who gave one of the invited talks, rightly argues that we need to pay attention to written language other than newswire. How can we computationally analyze a corpus of literary texts when the tools that we use are trained solely on the Wall Street Journal? With David Elson, one of her former PhDs, she has extracted social networks from 19th century literary fiction and finds no proof for Bakhtin’s notion of people in rural novels having fewer but more intense social contacts. Such work is scintillating, but difficult to perform with current tools, and that is why workshops such as Computational Linguistics for Literature (CLFL) are important.

Workshop Computational Linguistics for Literature
The workshop on Computational Linguistics for Literature was held at the last day of the conference. Livia Polanyi started the morning with a reflection on her ideas of what constitutes verbal art. Her presentation was a piece of art in itself, a poem rather than a lecture. Several times she raised the question what it is about verbal art that ‘makes strange.’ Or in other words: what is it about literature that defamiliarizes the reader? This question, that she repeatedly asks herself as if she is creating a mantra, is related to the question we ask in the Riddle project. Her answer is very structuralist, however (and somewhat poetic). Polanyi argues that poetry makes language strange and that prose makes the world strange. In our project we will not focus on the foregrounding elements in the novels that make the language strange, but her presentation was overwhelming, with some very interesting insights into the art that can be established in various ways through language.

Over the last decades the focus of literary scholars who study the textual features of literature has been mainly on poetry. Prose as a form of art is often considered harder to study, because of a lack of easily identifiable foregrounding elements. Currently however, more attention is being paid to novels. But poetry still seems to be a beloved object of study. This is reflected in the share of presentations during the CLFL workshop: half of them are about poetry, the other half, including ours, deals with prose. Interestingly enough, the presenters are not exclusively computational linguists. Most of them are “traditional” humanists who are pioneers in their field of study by addressing literary questions with computational methods. The most popular topics for those who study prose are to identify and classify the speech of the characters, either in an attempt to disambiguate free indirect discourse (as do Hammond, Brooke and Hirst for Virginia Woolf’s To the lighthouse ) or in order to build a social network of all the characters (which, after Elson et al (2010), is also done by He, Barbosa and Kondrak).

In our paper as well, we find that even in order to analyze the syntactic complexity of two genres, it is important to distinguish dialogue (implicit or explicit) from narrative and descriptions. In future research we aim to identify dialogue more effectively, to see whether there are differences in the language authors use in dialogue and narrative among different genres. Eventually we will apply these techniques to get closer to the answer of our Riddle: can we find differences between novels of different appreciation?

The proceedings of NAACL can be found on http://naacl2013.naacl.org/
The proceedings of the CLFL Workshop: http://aclweb.org/anthology/W/W13/#1400

All mentioned papers can be found on these websites, except for:
Elson, David K., Nicholas Dames, and Kathleen R. McKeow. “Extracting Social Networks from Literary Fiction.” In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 138–147. Association for Computational Linguistics, 2010.