PubMed is the main bibliographic database in the world of life sciences research and mastering its content is a challenge for libraries, given the richness and variety of its contents, but also the large volume and rapid growth of its metadata. New text and data mining tools, as well as the large computing capabilities of recent computers, make it possible now to accomplish this challenge.
In this course, you will be introduced to the use of Pandas and NLTK, libraries of the Python programming language that provides powerful and easy-to-use data structures manipulation, statistical and natural language analysis functions.
Participants will be able to choose and extract relevant PubMed XML metadata and combine it with other data sources such as their own library journals collections, institutional repositories (IR) references, Open Access information (unpaywall or DOAJ) or Wikipedia.
Each participant will freely select in advance his/her own project amongst some propositions. For example: extracting authors affiliations from PubMed to identify publications from one’s institution using regular expression and Levenshtein distance; then comparing these candidates with those in your IR using titles proximity and other metadata matching methods, to go back home with usable data to complete and enrich their IR.
At the same time participants will learn how to introduce their code and write the accompanying documentation in a Jupyter Notebook. This tool will allow them to create rich documents with text, mathematical formulas, graphics, images, even animations and videos, but also to execute computer code directly from the notebook. The combination of these free and open source tools therefore makes it possible to work comfortably on large volumes of data while documenting the successive stages of research, thus respecting the principles of reproducibility of science and obtaining a high degree of transparency on the research methods and results.
Learning Outcomes : Analyze PubMed data and understand its structure. Learn how to manipulate metadata in different formats (XML, JSON, CSV) and extract the parts we are interested in. Discover and evaluate open data sources that can be aggregated and learn to combine different datasets to produce new knowledge. Learn how to make simple statistical calculations and create graphs to visualize the results. Learn how to create notebooks by combining computer code, generated figures and documentation. It will also allow you to put yourself in the shoes of a researcher and help you understand the difficulties they may face in the context of ever-increasing transparency and reproducibility requirements.
Level : Intermediate
Target audience : Librarians involved in research support missions, system librarians, IT professionals working in biomedical libraries, and any other information and documentation specialists who wish to acquire skills in text and data mining and use tools to extract information and manipulate large volumes of structured and semi-structured data
Preparation for the session : Yes