SHARIAsource & OpenITI Workshop on Arabic OCR

Posted on February 25, 2020

The Open Islamicate Text Initiative and SHARIAsource at Harvard’s Program in Islamic Law last month co-convened a two-day workshop at the University of Maryland on developments in Arabic OCR. Funded by the Mellon Foundation, the conference brought together humanists and computer scientists to explore over a dozen solutions to crack the code of how to create a robust and reliable tool for Arabic OCR of historical texts. Workshop participants tackled all sides of an OCR job. How can librarians best digitize Arabic, Persian and other Islamicate texts? How do researchers create a “ground truth” of transcribed texts that an Arabic OCR tool will then convert automatically through Natural Language Processing and Machine Learning? How do we create the best UI/UX design – user interface and user experience that will make the experience seamless?

Achieving OCR that has a consistently high accuracy is not only a matter of easing humanities research, nor is it only about accessibility to primary sources. Accurate OCR can lessen limitations in historical and legal research that often obstruct thorough analysis of a question. Too many primary sources to analyze, too few primary sources to provide evidence, too vague patterns to glean anything substantive: accurate OCR turns these problems into more manageable aspects of research. The hope is that with greater accessibility to more evidence through text mining, text reuse tracking, and natural language processing, we can find new answers to old questions and discover new questions and deeper insights.

Updates live-tweeted under the hashtag #OpenITIAOCP, where presentation summaries and software resources can also be found.