iConference 2023 - Normality, Virtuality, Physicality and Inclusivity

Overview and details of the sessions of this conference. Please select a date or room to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Session

DII 1: Digital Information Infrastructures 1

Time:

Tuesday, 28/Mar/2023:

3:30pm - 5:00pm

Location: Room 4

Presentations

3:30pm - 4:00pm

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

N. Meuschke¹, A. Jagdale², T. Spinde¹, J. Mitrović^2,3, B. Gipp¹

¹University of Göttingen, Germany; ²University of Passau, Germany; ³The Institute for Artificial Intelligence R&D of Serbia

Extracting information from academic PDF documents is crucial for numerous indexing, retrieval, and analysis use cases. Choosing the best tool to extract specific content elements is difficult because many, technically diverse tools are available, but recent performance benchmarks are rare. Moreover, such benchmarks typically cover only a few content elements like header metadata or bibliographic references and use smaller datasets from specific academic disciplines. We provide a large and diverse evaluation framework that supports more extraction tasks than most related datasets. Our framework builds upon DocBank, a multi-domain dataset of 1.5M annotated content elements extracted from 500K pages of research papers on arXiv. Using the new framework, we benchmark ten freely available tools in extracting document metadata, bibliographic references, tables, and other content elements from academic PDF documents. GROBID achieves the best metadata and reference extraction results, followed by CERMINE and Science Parse. For table extraction, Adobe Extract outperforms other tools, even though the performance is much lower than for other content elements. All tools struggle to extract lists, footers, and equations. We conclude that more research on improving and combining tools is necessary to achieve satisfactory extraction quality for most content elements. Evaluation datasets and frameworks like the one we present support this line of research. We make our data and code publicly available to contribute toward this goal.

4:00pm - 4:30pm

Time lag analysis of adding scholarly references to English Wikipedia: How rapidly are they added to and how fresh are they?

J. Kikkawa, M. Takaku, F. Yoshikane

University of Tsukuba, Japan

Referencing scholarly documents as information sources on Wikipedia is important because they complement and improve the quality of Wikipedia content. However, little is known about them, such as how rapidly they are added and how fresh they are. To answer these questions, we conduct a time-series analysis of adding scholarly references to the English Wikipedia as of October 2021. Consequently, we detect no tendencies in Wikipedia articles created recently to refer to more fresh references because the time lag between publishing the scholarly articles and adding references of the corresponding paper to Wikipedia articles has remained generally constant over the years. In contrast, tendencies to decrease over time in the time lag between creating Wikipedia articles and adding the first scholarly references are observed. The percentage of cases where scholarly references were added simultaneously as Wikipedia articles are created is found to have increased over the years, particularly since 2007-2008. This trend can be seen as a response to the policy changes of the Wikipedia community at that time that was adopted by various editors, rather than depending on massive activities by a small number of editors.

4:30pm - 5:00pm

Is there a scientific digital divide? Information seeking in the international context of astronomy research

G. R. Stahlman

Rutgers University, United States of America

Access to informational research resources is critical to successful scien-tific work across disciplines. This study leverages a previously conducted survey of corresponding authors of a sample astronomy journal articles to investigate the existence and nature of a global “scientific digital divide”. Variables from the survey are operationalized, including GDP of respond-ent, whether the paper was produced through international collaboration, whether the author collected original observational data, and whether the author located data through accessing the literature. For exploratory pur-poses, Pearson’s r and Spearman’s rank correlation coefficients were calcu-lated to test possible relationships between variables, and some prelimi-nary evidence is presented in support of a scientific digital divide in as-tronomy. International collaboration is more common for respondents in lower-GDP countries; collecting observational data is more common with international collaboration; paper citation is impacted for respondents who do not collaborate internationally; and respondents from lower GDP coun-tries do not discover data through the scholarly literature less frequently. The study concludes that collaborative networks may be key to mitigating information seeking challenges in astronomy. These dynamics should be investigated through further research.

Preliminary Conference Agenda