Research Data Management: Challenges in a Changing World

12. bis 14. März 2025 in Heidelberg und online

JavaScript ist deaktiviert
Die JavaScript-Funktionalität Ihres Browsers ist deaktiviert. Um diese ConfTool-Funktion nutzen zu können, müssen Sie JavaScript aktivieren.
Hier finden Sie Informationen zur Aktivierung von JavaScript.
Bei Fragen oder Problemen wenden Sie sich bitte an das Organisationsteam unter e-science-tage@uni-heidelberg.de.

Veranstaltungsprogramm

Eine Übersicht aller Sessions/Sitzungen dieser Veranstaltung.
Bitte wählen Sie einen Ort oder ein Datum aus, um nur die betreffenden Sitzungen anzuzeigen. Wählen Sie eine Sitzung aus, um zur Detailanzeige zu gelangen.

Sitzungsübersicht

Sitzung

Präsentationen B5: Software & Data Science

Zeit:

Donnerstag, 13.03.2025:

15:30 - 16:10

Chair der Sitzung: Florian Heuschkel

Ort: 1. Obergeschoss, Hörsaal 07
Virtuelle Bühne für Hörsaal 7

Sitzungsthemen:

Institutionelle Einrichtung, FDM-Initiative, Datenrecht, Datenschutz, Datenethik und Datensicherheit, Datenarchivierung, Reproduzierbarkeit und Nachnutzung, NFDI-Bezug, Künstliche Intelligenz, Lebenswissenschaften, Naturwissenschaften, Nicht zutreffend/Fachbereichsübergreifend

Präsentationen

Institutional data science services at FDZ UB Mannheim: Enhancing research data management

Renat Shigapov, Irene Schumm, Thomas Schmidt, Jan Kamlah, Larissa Will

Universitätsbibliothek, Universität Mannheim

Data Science Services have rapidly emerged as essential support mechanisms for research data management (RDM) at universities worldwide. Notable examples include services at Harvard University, the University of Utah, Purdue University, NC State University, and the University of Groningen. In Germany, several initiatives have demonstrated how data science services can drive research data management. These include the Data Science Center at the University of Bremen, the Bielefeld Center for Data Science, and recent discussions on establishing Data Science Centers at higher education institutions.

In alignment with these developments, the research data center (FDZ) at the Mannheim University Library (UB Mannheim) has established institutional data science services at the University of Mannheim. Our goal is to enhance RDM, promote open science, and contribute to research reproducibility. We aim to empower researchers to undertake data science tasks with modern research data management practices. We support researchers throughout the entire data science pipeline — from data collection and processing to analysis, visualization, modeling, and reporting. Our services include not only expert consulting, RDM-focused training, and community engagement, but also implementing the data science pipelines and writing data papers together with researchers.

We begin by advising on the data science components of funding proposals, ensuring the feasibility of data science pipelines. We assist with or perform data acquisition using techniques such as web scraping, API calls, Optical Character Recognition (OCR), audio and video transcription, and data extraction from diverse sources.

Once data is collected, we provide support for or perform data cleaning, exploratory analysis, and modeling using Python and R, with a strong emphasis on open science and reproducibility. We guide researchers in writing open-source code, organizing their repositories on GitHub, archiving their codes, models, data, and documentation in data repositories, ensuring adherence to the FAIR (Findable, Accessible, Interoperable, Reusable) principles, and writing data papers.

We support deploying customized AI systems (chatbots) and using free cloud and institutional infrastructures. Recognizing that many researchers, due to their educational background, may have little to no programming experience, we offer guidance on low-code and no-code tools, empowering them to perform complex analysis without extensive programming skills.

Our services enhance the publication of research data by assisting researchers in presenting their data in accessible formats such as knowledge graphs, interactive web applications, and digital editions.

To foster collaboration and community engagement, we connect researchers with potential partners and actively participate in workshops and conferences hosted by our researchers such as the data science meetups organized by the Mannheim Center for Data Science and the GESS (Graduate School of Economic and Social Sciences) Research Day.

Our training sessions are part of the well-established “Research Skills” series, covering data science topics in RDM events such as “Data Literacy Essentials” and “RDM Seminars”.

This presentation will detail the development of data science services at the research data center of the Mannheim University Library. We will share our experiences in building these services, the challenges we faced, and the positive impact these services have had on RDM at our institution.

Resilient Hosting of Research Data Management Services

Halima Saker, Jens Krüger, Holger Gauza, Suvasini Thangaraj, Ursula Eberhardt, Simon Pirkl, Alexander Kirbis

High Performance and Cloud Computing Group at IT Center, Eberhard Karls Universität Tübingen

Modern research heavily relies on electronic data services. As an infrastructure provider, it is crucial to offer a reliable and highly available environment. Users of research data tools and services expect uninterrupted access to web-based platforms and protection from data loss or corruption for extended periods of time. Irregular services lead to delays in the research process, problems in collaborations, user dissatisfaction, and eventually abandonment of the resource. As an infrastructure provider, this expectation can be met by a high availability (HA) environment for services such as data analysis tools, data management resources, and even chat tools like Element.

The primary objectives of an HA environment are to minimize downtime in the event of a failure, prevent data loss, and facilitate automatic failover to ensure service stability. Another goal is to establish a flexible HA environment cost-effectively and without the threat of technology lock-in. Consequently, the approach presented here relies on open-source components and a cluster of virtual machines (VM).

The key components are Pacemaker and Distributed Replicated Block Devices (DRBD). DRBD uses real-time data replication to present the same data on two or more servers. It ensures that copies of the main server are made on all backup servers. Pacemaker acts as a server cluster manager in case of any issues within the server cluster, undertaking automatic failover. More importantly, Pacemaker has a fencing mechanism that prevents concurrent access of the same data portion by several servers for writing purposes. This minimizes the possibility of data corruption and ensures data integrity. Pacemaker, when combined with DRBD, ensures that in the event of a failure of a server, another one can immediately take over and provide access to the applications and data.

Scalability is an additional advantage of this setup. With the expansion of research projects, the system can be upscaled by incorporating more client servers or changing configurations. This HA setup has the stability and reliability required in research environments, whether it is handling increasing numbers of users, larger datasets, or more complex research workflows.

Conclusion

Combining Pacemaker and DRBD provides a reliable high-availability solution for hosting applications like GitLab in research institutions. Data is consistently replicated in the background and later protected by fencing techniques to guarantee availability and security from loss or corruption in the continuous workflow of the research processes.

Mobile Ansicht Druckansicht

Impressum · Kontaktadresse:

e-science-tage{at}uni-heidelberg dot

Datenschutzerklärung · Veranstaltung: E-Science-Tage 2025