Overview and details of the sessions of this conference. Please select a date or room to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).
1Chiba University, Japan; 2University of Tsukuba, Japan; 3International Institute for Digital Humanities, Japan
Human-assisted OCR is a common approach for transcribing books and has been used for many digital library projects.This paper reports our project for transcribing the book collections of National Diet Library in this approach. Our project is unique in two ways. First, we try to extend the human-assisted OCR approach by distributing microtasks in many ways other than just showing tasks in the specific Web page on PC screens. Second, we deal with Japanese books which have thousands of characters, some of which look similar to each other. This paper shows that we can expect high-quality results even if we transcribe Japanese texts with microtasks and the number of preformed microtasks to be stable if we distribute microtasks to equipment with witch worker perform microtasks in their daily lives.
CORA: A Platform to Support Citation Context Analysis
Bei Yu, Yatish Hegde, Yingya Li
Syracuse University, United States of America
In scholarly communication, researchers not only express their evaluative opinions towards peer work in citation statements, they retrieve and summarize these opinions by following citation links. Aggregated citation opinions are critical for determining validity of scientific claims as well as for identifying the best methods to be applied, which can provide valuable input for comprehensive literature review and reference for research impact assessment. Current effort for automated citation context analysis has focused more on algorithm design and much less on designing open-access platform to support sustainable research. We designed CORA, an open-access platform to support citation context analysis. This paper describes the CORA infrastructure that supports major tasks in citation context analysis, including free-text human annotation, management of large text corpora with layers of annotations and markups, citation context retrieval, developer APIs for algorithm design and implementation, and also user evaluation.
Evaluating the CC-IDF citation-weighting scheme: How effectively can ‘Inverse Document Frequency’ (IDF) be applied to references?
Joeran Beel1, Corinna Breitinger2, Stefan Langer3
1Trinity College Dublin, School of Computer Science & Statistics, ADAPT Centre, Ireland; 2University of Konstanz, Department of Computer and Information Science, Germany; 3Otto-von-Guericke University Magdeburg, Department of Computer Science, Germany
In the domain of academic search engines and research-paper recommender systems, CC-IDF is a common citation-weighting scheme. CC-IDF adopts the principles of the term-weighting scheme TF-IDF and assumes that if a rare citation is shared by two documents then this occurrence should receive a higher weight than if a citation is shared among a large number of documents. Although CC-IDF is in common use, we found no empirical evaluation and comparison of CC-IDF with plain citation weight (CC-Only). Therefore, we conducted such an evaluation and present the results in this paper. The effectiveness of CC-IDF and CC-Only was measured using click-through rate (CTR). For 238,681 delivered recommendations, CC-IDF had about the same effectiveness as CC-Only (CTR of 6.15% vs. 6.23%). In other words, CC-IDF was not more effective than CC-Only, which is a surprising result. We provide a number of potential reasons and suggest to conduct further research to understand the principles of CC-IDF in more detail.
TF-IDuF: A Novel Term-Weighting Scheme for User Modeling based on Users’ Personal Document Collections
Joeran Beel1, Stefan Langer2, Bela Gipp3
1Trinity College Dublin, School of Computer Science & Statistics, ADAPT Centre, Ireland; 2Otto-von-Guericke University Magdeburg, Department of Computer Science, Germany; 3University of Konstanz, Department of Computer and Information Science, Germany
TF-IDF is a popular term-weighting schemes, but with regard to recommender systems we see two shortcomings. First, calculating IDF requires access to the document corpus from which recommendations are made. Such access is not always given. Second, TF-IDF ignores information from a user’s personal document collection, which could – so we hypothesize – enhance the user modeling process. We introduce TF-IDuF as a term-weighting scheme that does not require access to the document corpus and that considers information from the users’ personal document collections. We evaluated the effectiveness of TF-IDuF compared to TF-IDF and TF-Only and found that both TF-IDF and TF-IDuF perform similarly with click-through rates (CTR) of 5.09% vs. 5.14%. Hence, both are around 25% more effective than TF-Only (CTR of 4.06%). We conclude that TF-IDuF is a promising term-weighting scheme. It is also notable that TF-IDuF and TF-IDF are not exclusive, so that both metrics may be combined to a more effective term-weighting scheme.