CEC afternoon 01
ID: 252 / CEC afternoon 01: 1
Topics: Technology Uptake
Keywords: PubMed, Text and data mining, Jupyter notebooks, Reproducibility, Open Data
Mining PubMed metadata with Pandas and Jupyter Notebooks
1Scientific Information Division, University Library of Geneva, Coordination Unit (CODIS). Rue du Général-Dufour 24, 1211 Geneva - Switzerland; 2Scientific Information Division, University Library of Geneva, Medicine and Pharmaceutical Sciences Unit (CMU). Rue Michel-Servet 1, 1211 Geneva - Switzerland
PubMed is the main bibliographic database in the world of life sciences research and mastering its content is a challenge for libraries, given the richness and variety of its contents, but also the large volume and rapid growth of its metadata. New text and data mining tools, as well as the large computing capabilities of recent computers, make it possible now to accomplish this challenge.
In this course, you will be introduced to the use of Pandas and NLTK, libraries of the Python programming language that provides powerful and easy-to-use data structures manipulation, statistical and natural language analysis functions.
Participants will be able to choose and extract relevant PubMed XML metadata and combine it with other data sources such as their own library journals collections, institutional repositories (IR) references, Open Access information (unpaywall or DOAJ) or Wikipedia.
Each participant will freely select in advance his/her own project amongst some propositions. For example: extracting authors affiliations from PubMed to identify publications from one’s institution using regular expression and Levenshtein distance; then comparing these candidates with those in your IR using titles proximity and other metadata matching methods, to go back home with usable data to complete and enrich their IR.
At the same time participants will learn how to introduce their code and write the accompanying documentation in a Jupyter Notebook. This tool will allow them to create rich documents with text, mathematical formulas, graphics, images, even animations and videos, but also to execute computer code directly from the notebook. The combination of these free and open source tools therefore makes it possible to work comfortably on large volumes of data while documenting the successive stages of research, thus respecting the principles of reproducibility of science and obtaining a high degree of transparency on the research methods and results.
Learning Outcomes : Analyze PubMed data and understand its structure. Learn how to manipulate metadata in different formats (XML, JSON, CSV) and extract the parts we are interested in. Discover and evaluate open data sources that can be aggregated and learn to combine different datasets to produce new knowledge. Learn how to make simple statistical calculations and create graphs to visualize the results. Learn how to create notebooks by combining computer code, generated figures and documentation. It will also allow you to put yourself in the shoes of a researcher and help you understand the difficulties they may face in the context of ever-increasing transparency and reproducibility requirements.
Level : Intermediate
Target audience : Librarians involved in research support missions, system librarians, IT professionals working in biomedical libraries, and any other information and documentation specialists who wish to acquire skills in text and data mining and use tools to extract information and manipulate large volumes of structured and semi-structured data
Preparation for the session : Yes
Biography and Bibliography
Pablo Iriarte is the information technology coordinator at the University Library of Geneva, Switzerland. He is also part-time teacher at the Information Science department of the Geneva School of Business Administration. Previously he worked many years as IT librarian specialist in the Lausanne University Medical Library and as research data librarian and Webmaster at the Data and Documentation unit of the Institute of Social and Preventive Medicine in Lausanne. His research fields are related to open science, research data, semantic Web and development of open source software for academic libraries.
Floriane Muller works as open access and research data librarian at the medical and pharmaceutical unit of the University of Geneva Library. She also collaborates with colleagues for teaching sessions and collection management. She has a master's degree in Information Science from the University of Applied Sciences Western Switzerland.