Scientific literature is a rich source of potential knowledge, but it is largely unstructured and not accessible by computers. Extracting and structuring information from the text and associated metadata of previously published work can produce new insight and augment other datasets. In this talk, I will present two research projects that aim to semi-automatically extract useful information from publications in support of existing data curation efforts in biodiversity and astronomy. For each project, I will discuss related work, provide an overview of new data collection and analysis, and highlight some outcomes and implications of this research.
First, a short 10-week pilot study was conducted between May and August 2018 through the LEADS-4-NDP fellowship program, which provided recommendations on methods and workflows for parsing geographic references in the text of the diverse Biodiversity Heritage Library (BHL) collections. An initial survey of the literature was conducted, and possible techniques and software were explored for document annotation, entity recognition and map visualization. We conclude that it would be more efficient to cluster geoparsing efforts according to similarity of document topics, document types, and languages, and that community input should be gathered to prioritize sectors of the corpus and information extraction techniques with higher likelihood of resulting in valuable new species occurrence data (Stahlman & Sheffield, Under Review). Citizen science is a promising method of ensuring the quality of extracted data.
Second, my dissertation project - “Exploring the Long Tail of Astronomy: A Mixed-Methods Approach to Searching for Dark Data” - is developing a stepwise methodology that captures disciplinary expertise and insights into the research practices, institutional influences and data infrastructures of astronomers to inform the development of heuristics for locating indicators of uncurated or at-risk astronomical data in the text and metadata of scholarly articles. This project builds on the “Astrolabe” cyberinfrastructure project, a collaboration between the University of Arizona and the American Astronomical Society, which has found that much “dark data” exist in astronomy for reasons such as lack of resources, changes in computational technology over time, and disciplinary norms (Heidorn, Stahlman & Steffen, 2018).
Heidorn, P. B., Stahlman, G. R., & Steffen, J. (2018). Astrolabe: Curating, Linking, and Computing Astronomy’s Dark Data. The Astrophysical Journal Supplement Series, 236(1), 3.
Stahlman, G.R. & Sheffield, C. (Under Review). Geoparsing Biodiversity Heritage Library
collections: A preliminary exploration. Extended poster abstract submitted for publication (iConference 2019).
Gretchen Stahlman is a PhD candidate in the University of Arizona School of Information. She holds a Master of Science degree in Library Science from Clarion University of Pennsylvania. As a doctoral student, Gretchen has participated in several projects exploring cyberinfrastructure strategies for “Long Tail” datasets in biology and astronomy, including assisting in development of the Astrolabe repository and computing environment for astronomy data. Gretchen also served as a Research Development Fellow in the UA Office for Research, Discovery & Innovation during the 2015-16 academic year. Most recently, Gretchen was selected as a 2018 LEADS Fellow through the IMLS-funded LEADS-4-NDP program, working with the Smithsonian Biodiversity Heritage Library (BHL) to produce recommendations for extracting georeferenced data from the text of BHL publications. Gretchen’s dissertation research focuses on identifying indicators of uncurated and at-risk “dark data” in the text and metadata of the scholarly literature in astronomy.