SHARIAsource & OpenITI Workshop on Arabic OCR

The Open Islamicate Text Initiative and SHARIAsource at Harvard’s Program in Islamic Law last month co-convened a two-day workshop at the University of Maryland on developments in Arabic OCR. Funded by the Mellon Foundation, the conference brought together humanists and computer scientists to explore over a dozen solutions to crack the code of how to create a robust and reliable tool for Arabic OCR of historical texts. Workshop participants tackled all sides of an OCR job. How can librarians best digitize Arabic, Persian and other Islamicate texts? How do researchers create a “ground truth” of transcribed texts that an Arabic OCR tool will then convert automatically through Natural Language Processing and Machine Learning? How do we create the best UI/UX design – user interface and user experience that will make the experience seamless?

Achieving OCR that has a consistently high accuracy is not only a matter of easing humanities research, nor is it only about accessibility to primary sources. Accurate OCR can lessen limitations in historical and legal research that often obstruct thorough analysis of a question. Too many primary sources to analyze, too few primary sources to provide evidence, too vague patterns to glean anything substantive: accurate OCR turns these problems into more manageable aspects of research. The hope is that with greater accessibility to more evidence through text mining, text reuse tracking, and natural language processing, we can find new answers to old questions and discover new questions and deeper insights. 

Updates live-tweeted under the hashtag #OpenITIAOCP, where presentation summaries and software resources can also be found.

SHARIAsource Partners at UMD & OpenITI Receive $800k Mellon Grant to Create Arabic OCR Tool

The Mellon Foundation awarded SHARIAsource partners at UMD & OpenITI a grant of $800,000 to continue work on Corpus Builder: the first Arabic OCR tool for historical texts, and an integral tool for research on Islamic law. SHARIAsource provided significant support in building the initial infrastructure for  CorpusBuilder 1.0 and will play a lead role in developing Corpus Builder 2.0 along with the new OpenITI Project.

Excerpt from OpenITI announcement

With generous funding from The Andrew W. Mellon Foundation, OpenITI AOCP will create a new digital text production pipeline for Persian and Arabic texts.

In June 2019 The Andrew W. Mellon Foundation generously awarded the University of Maryland, College Park (UMD) a $800,000 grant for the Open Islamicate Texts Initiative’s Arabic-script Optical Character Recognition Project (OpenITI AOCP).

The project is led by Matthew Thomas Miller (Roshan Institute for Persian Studies at UMD), Maxim Romanov (University of Vienna), Sarah Bowen Savant (Aga Khan University), David Smith (Northeastern University), and Raffaele Viglianti (Maryland Institute for Technology in the Humanities at UMD). SHARIAsource, a project of the Program in Islamic Law (PIL) at Harvard Law School (both led by Intisar Rabb), provided significant support for the initial technical infrastructure upon which this project will build (i.e., CorpusBuilder 1.0) and they will also play a leading role in the technical development portion of OpenITI AOCP.

OpenITI AOCP will catalyze the digitization of the Persian and Arabic written traditions by addressing the central technical and organizational impediments stymying the development of improved OCR for Arabic-script languages. Through a unique interdisciplinary collaboration between humanities scholars, computer scientists, developers, library scientists, and digital humanists, OpenITI AOCP will forge CorpusBuilder 1.0 — an OCR pipeline and post-correction interface — into a user-friendly digital text production pipeline with a wide range of new OCR enhancements and expanded text export functionality. The project will also include a series of workshops, a full corpus development pilot, and a Persian and Arabic typeface inventory, all of which will inform the development of the technical components in important ways.

Read full details. Image credit: Open Islamicate Texts Initiative (OpenITI)