Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
Wed 4b: TEI across corpora, languages, and cultures II
3:30pm - 5:00pm
Session Chair: Martin David Holmes, University of Victoria
Location:Lecture Hall HS 15.12 RESOWI building, section C, first floor
Advantages and challenges of tokenized TEI
Universidade de Coimbra, Portugal
TEI offers the option to split a text into words/tokens. However, existing tokenized corpora in TEI, such as the TEI version of the BNC corpus, (almost) never really use TEI, but rather are TEI based versions over traditional verticalized texts, with no elaborate TEI markup. When combining full-fledged TEI documents with tokenization, several issues and advantages occur. In this paper, I will discuss the solutions implemented in TEITOK, a TEI based corpus tool, which combines a searchable CWB corpus with editable TEI/XML files.
One set of problems is that elements such as <hi> can break tokens. These problems can be solved by splitting such elements into segments, where the new elements resulting from this can be explicitly marked as repetitions.
Another set of problems has to do with unary XML elements: as pointed out by Krause at al (2013) for instance, TEI does not indicate which page a token belongs to since <pb> are unary elements. However, this is easily overcome in a indexed corpus by attaching each token to the last preceding <pb>.
There are also advantages of tokenized TEI, such as the fact that there is no need for a @break="no", since word-breaking <lb> are simply those inside a token. And tokens can be associated with several forms, which allows for avoid <ex> and <choice>, making for a cleaner TEI.
An inline tokenized TEI documents with linguistic annotation can be turned into a fully searchable corpus, and the growing number of corpora of TEITOK show the usefulness of this approach. However, I will briefly review some of the elements of the TEI header that are less ideal for a searchable corpus, mostly having to do with part of the metadata that in the TEI header are thought of in a human-readable fashion.
Modelling linguistic knowledge in TEI: the case of the Vienna Corpus of Arabic Varieties
K. Mörth, D. Schopper
Austrian Academy of Sciences, Austria
VICAV’s main objective has been to collect and make available digital material concerning contemporary spoken Arabic varieties, including both linguistically relevant data as well as methodological information with regard to data and tools that can be applied in digitally enabled dialectology. Irrespective of its name, VICAV has been working on a number of quite divergent types of digital language resources such as language profiles, linguistic feature lists, sample texts, bibliographies, dictionaries and documentation of digital tools and workflows. Being situated at the crossroads between diatopic linguistic approaches and research-driven text technology, the project has been serving quite diverse aims: teaching spoken varieties of modern Arabic, teaching comparatistic Arabic linguistics, teaching text encoding by means of the TEI as well as experimenting with new technologies.
VICAV was conceived as a ‘research lab’ allowing to work on new tools and methodological aspects concerning data creation and visualisation. One of the results of the project is an easily deployable and maintainable environment which in its most recent version makes entirely use of X-technologies, data being stored and retrieved via REST directly from a BaseX database, implemented in XQuery, XSLT and XPath. The current interface is characterised by a dual approach to data representation, allowing data to be accessed both through interactive maps and traditional query interfaces. Results are visualised in specialised viewers enabling researchers and students to study the data by juxtaposing and thus comparing them.
One of the challenges of the project was the integration of the heterogeneous materials into a harmonised system allowing for flexible extensions. The multi-purpose research environment is entirely based on TEI P5 covering many different types of text, ranging from dictionary entries and linguistically annotated texts and corpora to georeferenced bibliographical records and a subject-specific taxonomy.
A sign of the times: medieval punctuation, its encoding and its rendition in modern times
E. Cugliana1, G. Barabucci2
1Università Ca' Foscari Venezia, Italy / Universität zu Köln, Germany; 2Universität zu Köln, Germany
Digitally managing punctuation in the editions of medieval manuscripts is one of those issues that initially look like minor details, but later reveal themselves as a tangled web of problems spanning from computer science (how to represent punctuation signs?) to philology (what types of signs do exist?) through epistemology (is the processing of punctuation a mere technical transformation or a valuable part of the scholarship?). The aim of this paper is to address the theoretical aspects of these questions and their practical implications, providing a couple of solutions that fit the paradigms and the technologies of the TEI.
The debate on how to deal with medieval punctuation is a long and still open one. Following Contini (1992), the interpretative edition of a manuscript is the translation of a historically attested system into another system. Accordingly, the philologist should recognize the punctuation system of the manuscript and convert it into a modern one. There are, however, no established universal methods for doing so; most of this work is left to the experience (and taste) of the scholar. In fact, editors often substitute the original punctuation with a modern one. This improves the text readability, but leads to an intolerable loss of textual information.
In this paper we discuss the approaches and methods used to encode, record, process and transform the punctuation of some German manuscripts of Marco Polo’s travel account. In addition to showing the TEI encoding of the signs, we address the topic of the transformation of a single original source into different transcriptions: from a “hyperdiplomatic” edition to an interpretative one, going through a spectrum of intermediate levels of normalization. We also reflect on the separation between transcription and analysis, as well as on the role of the editor when the edition is the output of a semi-automated process.