About the Corpus application

The corpus application is developed by the INT. The backend of the application is the BlackLab  Lucene based search engine developed for corpora with token-based annotation ( http://inl.github.io/BlackLab/ ).  The web-based frontend is a further development of the corpus-frontend application developed by INT ( https://github.com/INL/corpus-frontend ) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University ( https://github.com/Taalmonsters/WhiteLab2.0 ).

About the Corpus Gysseling

The Corpus Gysseling  is the collection of all thirteenth-century texts that served as source material for the Vroegmiddelnederlands Woordenboek  (VMNW; Dictionary of Early Middle Dutch ). It is the digital edition, enriched with part of speech and lemma, of the thirteenth-century material from the Corpus van Middelnederlandse teksten (tot en met het jaar 1300)  - Corpus of Middle Dutch texts (up to and including the year 1300)  - published in the period 1977-1987 by the Ghent linguist Maurits Gysseling.

A first online accessible version of the corpus was launched on 25 April 2012.

In 2018, the corpus was integrated in the Nederlab portal, in a new version, containing corrections of the linguistic annotation and additional metadata to the texts in the corpus.

In this new version several corrections have been made to the added metadata in the corpus.

The Corpus Gysseling , consisting of fifteen volumes, is divided into two series. The first series - CG I - contains official documents and has nine volumes, the second series - CG II - contains literary manuscripts and has six volumes. Note that the O ld Dutch texts from the printed Corpus Gysseling  are not included in this corpus application , as can be seen in the table below. These Old Dutch texts have been integrated in the   Corpus Oudnederlands .

Corpus Gysseling (book)

Corpus Gysseling (corpus application)

CG I

  • official documents

CG I

  • official documents

CG II

  • literary texts and artes texts (13 th  century)
  • including Old Dutch

CG II

  • literary texts and artes texts (13 th  century)
  • excluding Old Dutch

All texts were tokenised , tagged with Part of Speech, lemmatised and annotated with extensive metadata. All annotations of the Corpus Gysseling  were verified manually. The corpus data are available for researchers, cf. https://ivdnt.org/onderzoek-a-onderwijs/corpora-lexica/ corpus-gysseling .

Lemmatization

The Early Middle Dutch word forms all have a modern Dutch lemma. For words no longer used in modern Dutch, a modern lemma has been constructed using the same linguistic principles applicable to still existing words.

More information about the used lemmatization principles can be found in Marijke Mooijaart, Het lemma in the GiGaNT lexicon .

Part of speech tagging

The part of speech tagging of the Corpus Gysseling  was originally done using the same principles for annotation and tagset as for the Corpus Van Reenen Mulder, using a system of three digits. This numerical encoding can be used to search the online corpus.

A detailed description can be found here .

In the context of the CLARIAH+ project, a tagset and tagging principles for the annotation of diachronic corpora of historical Dutch has been developed. This annotation layer has been added to the corpus, and can also be used to search the online corpus.

A detailed description can be found here .

Credits

When referring to the present website, please use the following reference:

Corpus Gysseling (July 2020) [Online service]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-r8

The complete corpus data are available. When using the data, please use the following reference: Corpus Gysseling (Version 1.0) (1990) [Data set]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-j4

For BlackLab:

Software available at https://github.com/INL/BlackLab

Does, Jesse de, Jan Niestadt en Katrien Depuydt (2017), Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 151-165. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi

For the corpus frontend:

Software available at: https://github.com/INL/corpus-frontend