Developing Scripts to Scale Out Open Access E-book Acquisitions at the Library of Congress

The Digital Content Management section (DCM) of the Library of Congress (LC) has incrementally improved upon piloted workflows to identify, describe, process, preserve, and make accessible open access e-books to Library users. In 2018 and 2019, DCM staff led several pilot projects to test technical methods for obtaining e-books from various sources, transforming descriptive metadata, adding rights metadata, and processing the content for presentation on loc.gov. In 2022, DCM staff worked to meet a recent LC-objective to “Routinize processes for acquiring and making available open access & openly available e-books”. This effort focused on the routinization of acquisition of titles from the Directory of Open Access Books (DOAB), which offers over 61,000 open access titles. Scalable data processing methods were required for routinizing content acquisition and processing workflows for e-books from DOAB resulting in the development of 11 new Python scripts, which identified eligible titles for inclusion in the Library’s permanent collection, sorted titles into queues by type of required processing work, and acquired content and representative images at scale, and preserving and providing access to an additional 1700 open access e-books. This presentation will highlight the workflow development and refinement, challenges, and next steps for repeating this now-routinized work.

Speaker(s)

Lauren Seroka

Kristy Darby

March 16^th

4:25 PM

15 minutes