Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
Session Overview
Session
CEC afternoon 01
Time:
Monday, 17/Jun/2019:
1:30pm - 4:30pm

Location: Room 103

Show help for 'Increase or decrease the abstract text size'
Presentations
ID: 252 / CEC afternoon 01: 1
CEC session
Topics: Technology Uptake
Keywords: PubMed, Text and data mining, Jupyter notebooks, Reproducibility, Open Data

Mining PubMed metadata with Pandas and Jupyter Notebooks

Pablo Iriarte1, Floriane Muller2

1Scientific Information Division, University Library of Geneva, Coordination Unit (CODIS). Rue du Général-Dufour 24, 1211 Geneva - Switzerland; 2Scientific Information Division, University Library of Geneva, Medicine and Pharmaceutical Sciences Unit (CMU). Rue Michel-Servet 1, 1211 Geneva - Switzerland

PubMed is the main bibliographic database in the world of life sciences research and mastering its content is a challenge for libraries, given the richness and variety of its contents, but also the large volume and rapid growth of its metadata. New text and data mining tools, as well as the large computing capabilities of recent computers, make it possible now to accomplish this challenge.

In this course, you will be introduced to the use of Pandas and NLTK, libraries of the Python programming language that provides powerful and easy-to-use data structures manipulation, statistical and natural language analysis functions.

Participants will be able to choose and extract relevant PubMed XML metadata and combine it with other data sources such as their own library journals collections, institutional repositories (IR) references, Open Access information (unpaywall or DOAJ) or Wikipedia.

Each participant will freely select in advance his/her own project amongst some propositions. For example: extracting authors affiliations from PubMed to identify publications from one’s institution using regular expression and Levenshtein distance; then comparing these candidates with those in your IR using titles proximity and other metadata matching methods, to go back home with usable data to complete and enrich their IR.

At the same time participants will learn how to introduce their code and write the accompanying documentation in a Jupyter Notebook. This tool will allow them to create rich documents with text, mathematical formulas, graphics, images, even animations and videos, but also to execute computer code directly from the notebook. The combination of these free and open source tools therefore makes it possible to work comfortably on large volumes of data while documenting the successive stages of research, thus respecting the principles of reproducibility of science and obtaining a high degree of transparency on the research methods and results.     

Learning Outcomes : Analyze PubMed data and understand its structure. Learn how to manipulate metadata in different formats (XML, JSON, CSV) and extract the parts we are interested in. Discover and evaluate open data sources that can be aggregated and learn to combine different datasets to produce new knowledge. Learn how to make simple statistical calculations and create graphs to visualize the results. Learn how to create notebooks by combining computer code, generated figures and documentation. It will also allow you to put yourself in the shoes of a researcher and help you understand the difficulties they may face in the context of ever-increasing transparency and reproducibility requirements.

Level : Intermediate

Target audience : Librarians involved in research support missions, system librarians, IT professionals working in biomedical libraries, and any other information and documentation specialists who wish to acquire skills in text and data mining and use tools to extract information and manipulate large volumes of structured and semi-structured data

Preparation for the session : Yes

Biography and Bibliography
Pablo Iriarte is the information technology coordinator at the University Library of Geneva, Switzerland. He is also part-time teacher at the Information Science department of the Geneva School of Business Administration. Previously he worked many years as IT librarian specialist in the Lausanne University Medical Library and as research data librarian and Webmaster at the Data and Documentation unit of the Institute of Social and Preventive Medicine in Lausanne. His research fields are related to open science, research data, semantic Web and development of open source software for academic libraries.

Floriane Muller works as open access and research data librarian at the medical and pharmaceutical unit of the University of Geneva Library. She also collaborates with colleagues for teaching sessions and collection management. She has a master's degree in Information Science from the University of Applied Sciences Western Switzerland.


 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: EAHIL Workshop 2019
Conference Software - ConfTool Pro 2.6.129+TC+CC
© 2001 - 2019 by Dr. H. Weinreich, Hamburg, Germany