Research into quality of literary texts is extremely unusual (Von Heydebrand & Winko 1996, Albers 2007, Van Peer 2008, Louwerse et al. 2008). The project builds on stylistic research usually related to authorship attribution, but wants to refocus the used methods and techniques for quality discrimination purposes, which is new. The basic assumption underlying this research is that literary quality is not only decided by social and cultural factors, but also by formal characteristics of the texts which are being evaluated. It is the formal part of quality that will be dealt with in this research, applying both low-level and high-level pattern recognition. By low-level patterns we mean all features that are directly observable in texts, such as word type frequency, average sentence length, vocabulary distribution. By high-level patterns we mean all features that are not directly observable in texts, such as syntactic structure, semantic meaning, motifs and narrative structure.
Low-level patterns in literary texts have been studied relatively often in the context of authorship attribution and occasionally also for stylistic research (Hoover 2010). High level pattern recognition has hardly been applied to literary texts, which is why we will focus on a closer description of this method in the proposal. Raghavan, Kovashka and Mooney (Raghavan et al. 2010) have recently shown that both high-level syntactic structure and low-level lexical information are useful in capturing an author’s overall writing style. They used a probabilistic context-free grammar (PCFG) to model syntactic information for the problem of authorship attribution. Therefore it seems promising to further investigate the usefulness of syntactic structure and other high-level patterns for the problem of style and literary quality in general. We propose on the one hand to use PCFGs for this problem, but also an extension of PCFGs that can include richer syntactic context: probabilistic tree-substitution grammars (PTSGs), also known as data-oriented parsing (DOP) models (Bod et al. 2003, Bod 2009). These models operate with productive units that go beyond the limited context of the rules of PCFGs. Instead, the units of PTSGs are subtrees of (in principle) arbitrary size and thus include PCFGs as special cases. Still, there exist efficient algorithms to parse with PTSGs. Bansal and Klein (2010) have recently demonstrated that PTSGs have a large number of advantages over the lexically-insensitive PCFGs. The VICI-group led by Rens Bod at the ILLC of the University of Amsterdam has a long-time expertise with PTSGs (a.o. Scha 1990, Bod 1992, 2009, Sima’an 1996, Zuidema 2006, Sangati et al. 2010). The goal of this project is therefore to explore structural, syntactic and narrative properties of style and literary quality and to integrate these with the existing low-level stylistic features. The first experiments by Raghavan et al. (2010) have shown that success in this area can be achieved, which suggests that extensions towards richer structure, such as tree-substitution grammars, may enhance the prediction of style and quality.
Apart from finding an answer to the literary research question, the main challenge of the proposal is to find out to what degree the analysis of different pattern levels contributes to stylistics in general and to insight into literary quality in particular, and to which other possibilities the results might lead.
The research closely links to the project Authorship Attribution and Stylistics at the Huygens Instituut for the History of the Netherlands (project leader Karina van Dalen-Oskam) and in the research collaboration with 2010 visiting professor to the Huygens Instituut David L. Hoover who agreed to be part of the Think Tank for this project (Hoover 1999, 2010, Van Dalen-Oskam & Van Zundert 2007, 2008, Kestemont & Van Dalen-Oskam 2009).The project fits in the research focus point for literature as formulated in the Fryske Akademy Master Plan. Related research at the Huygens Instituut for the History of the Netherlands is done by Peter Boot, who focuses on studying the social processes of validation of literature in the preparation of a large grant proposal on this topic.
The project The Riddle of Literary Quality will research whether there are similarities in formal characteristics of texts considered to be ‘literary’ or ‘non-literary’. For a long time, the consensus among literary scholars was that literary quality was attributed in social communication. Recently, however, wanting to extend the cultural and historical explanations, they have taken the question of additional formal features into account again (Harris 1995, McDonald 2007, Vaessens 2009). The general audience usually equates ‘literary’ with ‘good’ and ‘non-literary’ with ‘bad’. Literary researchers are of a different opinion: they would rather make a distinction between types of readers or reader roles, arguing that the kind of texts read by these readers also come in categories of ‘good’ and ‘bad’ and anything in between (and that readers may change roles whenever they like). The distinction used in this project is the one drawn up by Von Heydebrand & Winko (1996, the most recent and a very thorough monograph on literary evaluation) between two types of reader roles: autonomous, when a reader focuses on formal and aesthetical characteristics of the text, and heteronomous, when a reader focuses on the content and relates this to his or her own personal experiences. Based on the analysis of a reader survey which will be conducted in the first phase of the project a training corpus will be created. This corpus will consist of texts clearly preferred by autonomous readers and texts clearly preferred by heteronomous readers. The texts will also be categorized as being highly preferred or least preferred in comparison to other texts in the same group.
In the analysis of formal elements of the fictional texts in the training corpus, the expectation is that there are characteristics which will show up more in texts preferred by autonomous readers than in texts preferred by heteronomous readers (or the other way round). This result would agree with the idea of the general public; it will lead to a list of formal tendencies in ‘literary’ or ‘non-literary’ texts. There is also another possible outcome, namely that some formal characteristics will turn up significantly more in the most popular of both types of texts and significantly less in the least successful of them (or the other way round). This would favour the opinion of literary scholars about the quality of texts independent of the types of readers. What we expect to find in this project, however, is a combination of these, e.g. a placement of a text anywhere on two axes with as extremest points autonomous – heteronomous versus high preference – low preference.
Some predictions are:
-fictional texts preferred by heteronomous readers tend to have a smaller vocabulary than those preferred by autonomous readers.
-fictional texts preferred by autonomous readers tend to have a more complex syntactic structure than those preferred by heteronomous readers.
-Fictional texts that are least preferred by both kinds of readers have a more complex syntactic structure than texts most preferred by both kinds of readers.
It will be clear that this seems to be contradictory as to the role of syntactic complexity. The expectation is, however, that the combination of formal characteristics will be different every time. The co-occurrence of syntactic complexity with high vocabulary richness could therefore be an indication of a text highly preferred by autonomous readers.
The texts in the training corpus are all contemporary Dutch novels. In the second phase of the project, the tools measuring the formal characteristics will be applied to a much larger corpus of Modern Dutch fictional texts and each of the analysed texts will be placed on the two axes mentioned above: autonomous – heteronomous versus high preference – low preference. The resulting placements will be analysed; this is expected to lead to a refinement of the tools and a new iteration of measurements, and eventually to a synthesis in publications and hypotheses for further research and development.
One of the main further research questions is clearly whether the formal analysis will be able to help looking backwards to earlier fictional texts and find out more about the formation of and changes in the canon of literary works through time. We expect that the preferences of readers change through time and in different cultural environments. Many of the characteristics to be measured are usually assumed to be language independent. Could changeing preferences be reflected in the formal characteristics of the fictional texts that were being published in the past? Can we make a distinction between preferred and less-preferred, autonomous and heteronomous without being able to conduct a survey of readers from the past? We do not expect to answer all these questions. We do want to map the problems and possibilities based on first experiments and draw up a project plan for follow-up research which will go into diachronic, longitudinal research and cross-language comparability.
We will therefore apply the tools to a set of older Dutch novels as well as on sets of novels written in other languages, limiting the choice of languages for now to two of the closest relatives of Dutch, Frisian and English. We will first make predictions about the placement of the texts to be analysed on the two axes, seen from the point of view of contemporary reviewers (19th-century reviewers of Dutch or American-English novels) and from the point of view of current scholars – preferably we will select texts of which we know that their evaluation changed over time: much preferred then, now forgotten, or the other way round). By testing this on different languages as well as a different time period we can keep an eye on changing patterns through time and language. This is expected to yield the necessary information to draw up new research plans.
Deliverables: (1) A list of formal characteristics and their distribution in a training corpus of differently valued Modern Dutch novels (publications); (2) An evaluation of other Modern Dutch novels based on the results of the training corpus (publications); (3) Results of first experiments of the application of the same measurements on novels from another time period and language (project plan for a new research program to adapt the tools for diachronic and cross-language application); (4) Texts (if possible, see the “Data paragraph”) and tools available online, including the documentation of the new computational techniques.
Work process: The tasks in the project will be executed iteratively where next steps are based on previous results, e.g. when it proves that the tool applied needs fine-tuning and re-application. This holds for most of the tasks described below.
Main tasks (Q = Period of three months)
1. Establishing the contents of the Modern Dutch training corpus (Q1)
The training corpus is of major importance for the success of the project. It will consist of two sets of novels, based on the distinction Von Heydebrand & Winko (1996) make between two types of reader roles: autonomous and heteronomous, and a marking of high or low preference compared to other texts in the same group. The procedure for text selection will be informed by empirical psychology (Bortolussi & Dixon 2003, Bryant & Vorderer 2006) by means of surveys of readers from different backgrounds, making use of online survey possibilities (web based crowd sourcing possibilities will be considered). Readers will be asked to answer many questions which will help us to select texts for the different groups. It is of main importance that the survey will eliminate possible confounds, which is why it is seen as necessary to include an experimental psychologist in the ThinkTank, who will not only guide the preparation of the survey, but also co-author publications on the more wide-ranging results of the survey in publications related to the project.
2. Data curation and preparation of the texts of the training corpus (Q1-2)
The survey mentioned under Task 1 will list a.o. those novels that are already available digitally at Huygens Instituut or as raw OCR at the University of Tilburg where they are being considered for the SoNaR project. SoNaR is a project initiated by the STEVIN program for language and speech technology (http://lands.let.ru.nl/projects/SoNaR/description.html) and aims to create a 500 million word reference corpus of Modern Dutch written sources. Literary texts will be part of the corpus as well. Martin Reynaert (ILK – UvT), the coordinator of the work package Corpus Building in SoNaR, has suggested and agreed to the following task division: Huygens Instituut will digitize needed novels for the project The Riddle of Literary Quality (in house, no extra budget needed) and correct the OCR of relevant novels already scanned by Reynaert. Reynaert will cover the legal aspects of the electronic texts: the SoNaR team will put in time to arrange permission of inclusion of those digital texts into SoNaR. This means the texts can be legally used for any kind of research (which is especially relevant for research into literary quality, assuming that authors or publishers may not all be equally willing to give permission for their texts to be researched from the perspective of this topic). Methods for establishing text quality in general will be looked into.
3. Selection of low-level tools (measures and algorithms to apply) (Q1-3)
The relevant measures and patterns will be selected. The list will include such features as mean word length, mean amount of syllables per word, mean sentence length, mean paragraph length; lexical richness (measuring how many different words (dictionary entries) a text has, use and distribution of parts of speech, use of named entities, dialogue versus narrative, and any other that are suggested in the literature and e.g. available in the JGAAP set of tools (cf. Task 4,2b and http://evllabs.com/jgaap/w/index.php/Documentation and Juola 2006a, 2006b). Tools specifically developed for Dutch by the Induction of Linguistic Knowledge group at Tilburg University will also be used. They are available under an open source license GPL and will also be part of the webservices to be provided by CLARIN-NL (cf. http://www.clarin.nl). We will not go into this in detail now, but focus our description on the most innovative task in the project, Task 4.
4. Selection and start of the development of the high-level tools (measures and algorithms to apply) (Q1-3 start, continuing throughout the whole project)
This task is the computationally most innovative part of the project, which is why we will describe it in more detail than the other tasks. The iterative aspect of the work process will be most prominent in this task. Work is started right at the beginning of the project and will continue through the whole four years of the project. Most of the work of the Postdoc and developer will go into this task.
The high-level tools will be based on Probabilistic Tree-Substitution Grammars (PTSGs) as developed within the Data-Oriented Parsing (DOP) framework (cf. Bod et al. 2003). PTSGs subsume Probabilistic Context-Free Grammars (PCFGs) when the syntactic dependencies are restricted to one level of constituent structure. The tools for learning PTSGs are available via the DOP homepage: http://staff.science.uva.nl/~rens/dop.html. PTSGs and PCFGs can be derived both in a supervised and in an unsupervised manner using flat, unannotated data (the latter known as U-DOP, see Bod 2007, 2009, Smets 2010). Since enough flat data is available for Dutch, it is preferable to use unsupervised parsing techniques (U-DOP), since it overcomes the problem of manually annotating training sets, while still being able to induce trees with syntactic categories (cf. Smets 2010). Both approaches will be tested, however. Given the success of the simplest PTSGs for style recognition (i.e. PCFGs tested by Raghavan et al. 2010), it is likely that models richer than PCFGs will enhance the prediction of style. To apply these tools and algorithms to the analysis of literary quality, the following extensions will be developed and tested:
(1) Develop and test the usefulness of various (novel) notions of syntactic complexity (as motivated above) for literary style/quality:
a. Define and test syntactic complexity as the number of nodes in a syntactic tree of an utterance normalized by sentence length: the more nodes per sentence-length, the more complex the syntax of the sentence. (This is a very simple metric, but will be useful as a baseline).
b. Define and test syntactic complexity as the number of subtrees by which a tree is generated by the induced PTSG: the fewer subtrees that are needed, the more repetitive a sentence or a text, and the less complex. This can be tested both internally and externally: if the PTSG is induced from external (i.e. held-out) data, the text can be tested on ‘originality’ versus ‘repetitiveness’ using this measure. Also the amount of conventional phrases can be detected in this way.
c. Parse ‘high literature’ (autonomous) with a PTSG trained on ‘low literature’ (heteronomous) and vice versa. The percentage of high-literary sentences that can be parsed by PTSGs trained on low literature can be seen as a measure for literary complexity. Also the opposite may be interesting to test: can low literary sentences always be parsed by PTSGs that are trained on high literary sentences? If so, we can conclude that low literary sentences are ‘included’ by high literary sentences. If not, low literary sentences are to some extent ‘different’ from high literary sentences (which can be quantified by the percentage of parsed sentences). This will also be tested on and compared to a small set of novels for children and adolescents.
(2) Adapt PTSGs (DOP and U-DOP) to literary parsing.
a. During the last few years, both unsupervised and supervised PTSGs have been tested on large-scale text corpora in Bod’s VICI group (see e.g. Bod 2009; Sangati et al. 2010, Smets 2010). However, training on literary texts may result in interesting new problems. The two available servers in Bod’s VICI group operate both with 256 GB internal memory (RAM), which can (incrementally) induce PTSGs for over 100 million words. Once the PTSGs are induced, they can run efficiently on smaller platforms (with 48 GB internal memory). It is expected that the UvA-algorithms can be straightforwardly applied to induce PTSGs for literary texts (made available by Huygens Instituut), such that they can directly be used for the definition of syntactic complexity above.
b. Integrate the syntactic property of complexity as an additional feature in the JGAAP framework (Juola 2006). Although syntactic complexity can be tested separately, as proposed under (1), we also want to integrate these as (weighted) features in the general JGAAP model for style recognition. This may result in a fine-grained comparison between a wide range of different textual properties and patterns, leading to the first stylistic analysis that integrates low-level and high-level patterns.
c. Investigate an extension towards story grammars. Recent work in the ILLC by Löwe et al. (2010) on story grammars shows that discourse structure can reveal underlying narrative building blocks (‘narratemes’) as well as motifs and plots. Although full-fledged story analysis is still in its infancy, it may contribute to finding even higher level patterns, such as plot lines of narratives. This part of the project does not guarantee success, but it is exciting enough to be tested. It should be stressed that application of the tools above will already result in a successful project, and that the use of more sophisticated state-of-the-art techniques makes the project computationally and intellectually more fascinating.
5. Integrating tools in a simple user-interface (Q2-5 and continuing throughout the project)
To make the application of the tools usable and testable by the literary scholars linked to the project, a simple online interface will be created in a form and environment that will be decided on by the criteria of online availability and efficiency on different platforms. A promising candidate is CLAM (Computational Linguistics Application Mediator) developed by ILK at Tilburg University within CLARIN-NL projects with the specific aim of having a flexible yet thorough solution for turning existing applications into fully-fledged web applications/services. The tools will be integrated as soon as they are available, starting with the low-level pattern recognition tools as described under Task 3. The high-level tools under Task 4 will be added as soon as they are developed.
6. Applying the tools to the Modern Dutch training corpus (Q3-6)
First tests of the tools on the two training sets, mostly the ones measuring low-level patterns (cf. Task 3) and the first available high-level pattern recognition tools as described in Task 4. An analysis of the results, fine-tuning of the tools, looking for trends and weighing factors will be performed.
7. Description of the preliminary results and registration of predictions for other texts (Q5-6)
Intermediate results will be discussed in a closed workshop by the project team with a.o. the associated external scholars. Predictions of results for other texts will be listed, thus planning the next steps of the research.
8. Applying the fine-tuned tools to other Modern Dutch texts (Q6-9, and continuing throughout the project)
The fine-tuned tools as described in Task 5 will be applied to texts or text groups (e.g. ‘literary thrillers’) from the larger text corpus established in collaboration with SoNaR to find out if the patterns this will yield agree with the predictions made under Task 7. Results will be presented in papers at conferences; competing for the Academic Year Prize will be considered.
9. Applying the tools to Modern Frisian and on 19th-century Dutch and American-English novels (Q9-12, and continuing throughout the project)
To research their across-language applicability, the tools will be applied to a corpus of Modern Frisian novels. The diachronic applicability will be tested on a corpus of American novels from the 19th and the early 20th century as collected by professor David L. Hoover (New York University) and on available 19th-century Dutch novels. We do not expect to be able to comprehensively address this topic, which we therefore approach as an experiment which can show where the possibilities and the problems are to be expected. The experiments, which will probably have an iterative nature, will result in a project plan for follow-up research into canon formation and cross-language comparison of quality characteristics. The Frisian texts will be provided by the Frisian Language Database and by Tresoar (provincial library of Friesland).
10. Overall analysis of the results (articles, volume of papers, dissertation) (Q11-16)
A combination of a closed and open workshop will be organized for the presentation and discussion of further results. Articles will be published in e.g. a volume of papers. Other publications and the dissertation will be written. New research plans for following-up research will be written in collaboration with the different partners.
11. Release of tools for the wider scholarly community on the website of Huygens Institute for the History of the Netherlands (Q15-16)
The tools will be made available online through a server of Huygens Instituut (also acting as a CLARIN centre), with documentation and the possibility of instruction sessions for interested scholars. The analysed texts will be made available in the SoNaR corpus.
Added Data paragraph
The tools to be developed will be archived according to the procedure which is being established by the cooperating five CLARIN centres (a.o. Huygens ING and DANS).
Copyright is a main problem for open access of the novels which will be digitized especially for this project. These will become freely available in the SoNaR corpus if Martin Reynaert can arrange legal permission. It will be researched whether the data resulting from the measurements can be made available without copyright infirngement to overcome some of the drawbacks of not being able to provide researchers with all digitized novels.
The aim is to also make the tools to be developed available open access through the CLARIN center infrastructure, in the architecture that Huygens ING is currently developing for webservices and according to the standards to be decided on in the CLARIN environment.
All metadata are expected to become available open access.
Albers 2007 Sabine Albers, ‘Top or Flop: characteristics of bestsellers’. In: Lesley Jeffries, Dan McIntyre and Derek Bousfield (Eds.), Stylistics and social cognition. Amsterdam / New York: Rodopi, 2007 (Poetics and Linguistics Association (PALA): 4), p. 205-215
Bansal and Klein 2010 M. Bansal & D. Klein, ‘Simple, Accurate Parsing with an All-Fragments Grammar’. In: Proceedings ACL 2010, Stroudsburg: Association for Computational Linguistics, p. 110-117
Bod 1992 R. Bod, ‘A Computational Model of Language Performance: Data-Oriented Parsing’. In: Proceedings COLING 1992, Stroudsburg: Association for Computational Linguistics, p. 855-859
Bod et al. 2003 R. Bod, R. Scha & K. Sima’an (Eds.), Data-Oriented Parsing. Stanford: CSLI Publications, 2003
Bod 2007 R. Bod, ‘Is the End of Supervised Parsing in Sight?’ In: Proceedings ACL 2007, Stroudsburg: Association for Computational Linguistics, p. 400-407
Bod 2009 R. Bod. ‘From Exemplar to Grammar: A Probabilistic Analogy-Based Model of Language Learning’, In: Cognitive Science, 33 (2009), 5, p. 752-793
Bortolussi & Dixon 2003 Marisa Bortolussi & Peter Dixon, Psychonarratology. Foundations for the empirical study of literary response. Cambridge University Press, 2003
Bryant & Vorderer 2006 Jennings Bryant & Peter Vorderer (eds.), Psychology of entertainment. (Routledge Communication Series) Routledge, 2006
Harris 1995 Wendell V. Harris, Literary meaning. Reclaming the study of literature. New York: Palgrave Macmillan, 1995
Hoover 1999 David L. Hoover, Language and style in The Inheritors. Lanham etc.: University Press of America, 1999
Hoover 2010 David L. Hoover, ‘Authorial Style’. In: Dan McIntyre and Beatrix Busse (eds.), Language and Style: Essays in Honour of Mick Short, New York: Palgrave, 2010, p. 250-271
Juola 2006a Patrick Juola, John Sofko & Patrick Brennan, ‘A Prototype for Authorship Attribution Studies’. In: Literary and Linguistic Computing 21: 169-178
Juola 2006b Patrick Juola, ‘Authorship Attribution’. In: Foundations and Trends in Information Retrieval 1 (2006), 3, p. 233-334; http:/dx.doi.org/10.1561/1500000005
Kestemont & Van Dalen-Oskam 2009 Mike Kestemont & Karina van Dalen-Oskam, Predicting the Past: Memory-Based Copyist and Author Discrimination in Medieval Epics. In Proceedings of the twenty-first Benelux Conference on Artificial Intelligence (BNAIC 2009). Eindhoven, p. 121-128.
Louwerse et al. 2008 Max Louwerse, Nick Benesh & Bin Zhang, ‘Computationally discriminating literary from non-literary texts’. In: S. Zyngier, M. Bortolussi, A. Chesnokova, J. Auracher (Eds.), Directions in empirical literary studies, Amsterdam: Benjamins, 2008, p.175-192
Löwe et al. 2009 Benedikt Löwe, Eric Pacuit and Sanchit Saraf, ‘Identifying the Structure of a Narrative via an Agent-based Logic of Preferences and Beliefs: Formalizations of Episodes from CSI: Crime Scene Investigation’, in Michael Duvigneau en Daniel Moldt (eds.), MOCA’09, Fifth International Workshop on Modelling of Objects, Components, and Agents, Hamburg, 2009.
McDonald 2007 Ronan McDonald, The death of the critic. New York/London: Continuum, 2009 (1st ed. 2007)
Raghavan et al. 2010 Sindhu Raghavan, Adriana Kovashka, Raymond Mooney, ‘Authorship Attribution Using Probabilistic Context-Free Grammars’. In: Proceedings ACL 2010, http://www.aclweb.org/anthology/P/P10/P10-2008.pdf
Sangati et al. 2010 F. Sangati, W. Zuidema & R. Bod, ‘Efficiently extract recurring tree fragments from large treebanks’. In: Proceedings LREC10, Malta
Scha 1990 R. Scha, ‘Taaltheorie en Taaltechnologie; Competence en Performance’. In: Q. de Kort & G. Leerdam (Eds.), Computertoepassingen in de Neerlandistiek. Almere: Landelijke Vereniging van Neerlandici, 1990, p. 7-22
Sima’an 1996 K. Sima’an, ‘Computational complexity of probabilistic disambiguation by means of tree grammars’. In: Proceedings COLING 1996 Stroudsburg: Association for Computational Linguistics, p. 1175-1180
Smets 2010 Margaux Smets, ‘A DOP-inspired approach to syntactic category induction’, Paper presented at CLIN 2010 (currently under submission)
Vaessens 2009 Thomas Vaessens, De revanche van de roman. Literatuur, autoriteit en engagement. Nijmegen: Uitgeverij Vantilt, 2009
Van Dalen-Oskam & Van Zundert 2007 Karina van Dalen-Oskam & Joris van Zundert, Delta for Middle Dutch – Author and Copyist Distinction in Walewein. In Literary and Linguistic Computing 22 (2007), p. 345-362
Van Dalen-Oskam & Van Zundert 2008 Karina van Dalen-Oskam & Joris van Zundert, The Quest for Uniqueness: Author and Copyist Distinction in Middle Dutch Arthurian Romances based on Computer-assisted Lexicon Analysis. In Mooijaart, M., van der Wal, M. (eds.) Yesterday’s words: contemporary, current and future lexicography. [Proceedings of the Third International Conference on Historical Lexicography and Lexicology (ICHLL), 21-23 June 2006, Leiden]. Cambridge: Cambridge Scholars Publishing, p. 292-304
Van Peer 2008 Willie van Peer (ed.), The Quality of Literature. Linguistic studies in literary evaluation. Amsterdam: John Benjamins, 2008 (Linguistic Approaches to Literature 4)
Von Heydebrand & Winko 1996 Renate von Heydebrand & Simone Winko, Einfuehrung in die Wertung von Literatur. Systematik – Geschichte – Legitimation. Paderborn etc.: Ferdinand Schoeningh, 1996
Zuidema 2007 W. Zuidema, ‘Parsimonious data-oriented parsing’. In: Proceedings EMNLP 2007, Stroudsburg: Association for Computational Linguistics, p. 551-560