Duplicates in the repository: remediation and reconciliation in three systems, including DataCite
Academic Commons provides long-term open access to digital scholarships produced by Columbia University affiliates. Content may be added by authors through a self-deposit form, by library staff through the cataloging backend (Hyacinth), and via SWORD deposit from entities such as library-hosted OJS, journal publishers, and others. As one might expect, after fifteen years of additions through these various channels, duplication happens! When faced with a corpus of nearly 40,000 records that must be reviewed, with duplicates remediated in three separate systems, how does one even start? This poster illustrates our approach to defining and scoping this problem, as well as the project workflows and technical solutions we utilized to remediate approximately 300 duplicate item records and 600 associated asset records.
Technologies: Fedora, Solr, Rails, Python, DataCite