Parla-CLARIN: TEI guidelines for corpora of parliamentary proceedings
1Jožef Stefan Institute, Slovenia; 2Institute of Contemporary History, Slovenia
Parliamentary proceedings (PP) are a rich source of data used by e.g. scholars in historiography, sociology, political science, linguistics, and economics and economic history. As opposed to sources of most other language corpora, PP are not subject to copyright or personal privacy protections, and are typically available on-line thus making them ideal for compilation into corpora and open distribution. For these reasons many countries have already produced PP corpora, but each typically in their own encoding, thus limiting their comparability and utilisation in a multilingual setting.
The talk will overview current approaches to encoding PP, with a focus on TEI and TEI-like encoding and on Akoma Ntoso, a standard specifically designed for encoding PP and other legislative documents, and giving an overview of the encoding of existing PP corpora, gathered mostly from the CLARIN survey of such corpora (cf. https://www.clarin.eu/resource-families/parliamentary-corpora) and the contributions to the LREC 2018 ParlaCLARIN Workshop (cf. https://www.clarin.eu/ParlaCLARIN).
We then motivate and propose a TEI ODD (so, schema parametrisation and guidelines) for such corpora, based on the TEI module for Transcriptions of Speech. The empirical basis for the proposal is our encoding of the recently compiled 200 million token Corpus of parliamentary debates of the National Assembly of the Republic of Slovenia 1990-2018 (cf. http://hdl.handle.net/11356/1236). We distinguish encoding of metadata, including the speakers and their metadata, structure of the corpus, encoding of the speeches and notes, and linguistic annotation. This proposal will be further fleshed out at the CLARIN Workshop on Encoding Parliamentary Corpora, which takes place end of May in Utrecht, and the follow-up work on producing a CLARIN recommendation for encoding of parliamentary corpora.
Challenges in encoding parliamentary data: between applause and interjections
Austrian Academy of Sciences, Austria
Parliamentary data has always been of great interest to researchers in the social sciences and the humanities. There are many initiatives at European and national levels for compiling digital collections of parliamentary data. However, these initiatives use different encoding schemes to present parliamentary data ranging from ad hoc ones, over specific standards for representing legislation (e.g. Akoma Ntoso) to TEI. Akoma Ntoso has been created to make the structural and semantic components of digital parliamentary documents fully accessible to machine-driven processes. However, this encoding standard was not designed to include linguistic annotation. In this regards TEI is more suitable. The Austrian parliamentary record corpus, ParlAT (Wissik & Pirker 2018), which was before only available in a vertical format suitable for analysis in a corpus query system, is now being encoded in TEI. In this contribution, we will present the encountered challenges, often related to the fact, that the Austrian parliamentary records are edited shorthand records of the parliamentary sessions and not transcripts of recordings. The Austrian parliamentary records include a lot of comments within parenthesis and formatted differently than the other text, namely in italics. These comments range from indicating applause or laughter to indicating interjections. We decided to encode these comments in different ways using the elements from the transcription of speech module: to use <incident> for comments indicating applause or laughter and <u> (utterance) for comments indicating interjections . In the contribution, we will discuss the solutions for encoding such comments, regarding the Austrian case but also in relation to a more general scheme proposed by Tomaž Erjavec and Andrej Pančur, the teiParla .
A TEI customization for the description of paper and watermarks
University of Iceland, Iceland
TEI currently offers a large set of tools to describe material features of written documents. However, so far, these tools are not sufficient to produce comprehensive, structured descriptions of paper, and do not reflect the standards of paper historians. The present contribution will introduce a set of custom TEI-P5 modules for the description of paper and watermarks. They are modeled on the international standard for the description of paper, watermarks and paper molds in relational databases (IPHN 2.1.1, 2013). They combine new elements based on the parameters listed in the IPHN standard with the official TEI modules msdescription, which is suitable for the physical description of all text-bearing objects, and namesdates, which provides the necessary elements for the description of persons, places and organizations, and can therefore be used to enter information about papermakers and paper mills. This customization allows TEI users to make standardized descriptions of paper at different levels of precision. The data included in these descriptions can be mined and/or compared with the contents of other relational databases that follow the IPHN standard in order to identify the provenance of paper and thus better understand the conditions of production and the circulation of documents.
How we tripled our encoding speed in the Digital Victorian Periodical Project
University of Victoria, Canada
The Digital Victorian Periodical Poetry (DVPP) project is a SSHRC-funded digital humanities project based at the University of Victoria. With the guidance of principal investigator Dr. Alison Chapman, the DVPP team is creating a digital index of British periodical poetry from the long nineteenth century. In addition to uncovering periodical poems, writing descriptive metadata, and compiling prosopographical research, we are currently using TEI and CSS to encode a statistically-representative sample of indexed poems, looking for quantitative evidence of literary change over time. Such an endeavour requires a large, robust dataset covering a range of periodicals throughout the period.
At the time of writing, there are more than 12,000 poems in the database, and we expect that total to reach 20,000. Of these, around 2,000 will be encoded, focusing on the decade years (1820, 1830, 1840, and so on).
In this presentation, we will showcase the various strategies and tools we have used to speed up our encoding process. We combine simple tricks like keyboard shortcuts with more sophisticated processes to minimize drudgery and increase accuracy. Among the more interesting techniques are:
- Auto-tagging of a complete poem in lines and linegroups using a Schematron QuickFix;
- Use of advanced CSS selectors in the rendition/@selector attribute to reduce encoding clutter in the poem itself;
- A keyboard shortcut to tag rhymes which detects whether the tagged text is a masculine or feminine rhyme and provides the appropriate attribute value;
- Auto-detection of cases where a new line-end rhymes with a previously-encoded rhyme, and should, therefore, be labelled to match it, leveraging our growing dataset of nearly 30,000 rhymes;
- Instant access to to a rendering of the poem which provides a visualization of the rhyme structure, auto-detection of anaphora, epistrophe and other refrain-like forms, and other diagnostic feedback.
Manuscripta - The editor from past to future
1National Library of Sweden, Sweden; 2Språkbanken, University of Gothenburg, Sweden
In this paper we will introduce a web-based editor for TEI-encoded manuscript descriptions in Manuscripta - A Digital Catalogue of Manuscripts in Sweden (https://www.manuscripta.se). Cataloguing is done using an interface which does not require any knowledge of TEI and therefore simplifies and reduces the time required for the cataloguing process. Previously, it has been necessary to use an XML editor which had a steep learning curve and was time-consuming as well as error prone, even with schema validation and detailed cataloguing guidelines.
While implementations like the TEI Publisher is covering the TEI Processing Model with complex text transformations for outputting different media types, navigation, pagination, search, and facsimile display by utilising web components, the Manuscripta editor also covers other workflow jobs like authority database lookups, advanced templating, editor sign-off, and Schematron rule validation, in addition to schema-based validation. Sharing many goals, like the TEI Publisher, the Manuscripta editor is all about standards, modularity, reusability, and sustainability!