SHARIAsource Partners at UMD & OpenITI Receive $800k Mellon Grant to Create Arabic OCR Tool

The Mellon Foundation awarded SHARIAsource partners at UMD & OpenITI a grant of $800,000 to continue work on Corpus Builder: the first Arabic OCR tool for historical texts, and an integral tool for research on Islamic law. SHARIAsource provided significant support in building the initial infrastructure for  CorpusBuilder 1.0 and will play a lead role in developing Corpus Builder 2.0 along with the new OpenITI Project.

Excerpt from OpenITI announcement

With generous funding from The Andrew W. Mellon Foundation, OpenITI AOCP will create a new digital text production pipeline for Persian and Arabic texts.

In June 2019 The Andrew W. Mellon Foundation generously awarded the University of Maryland, College Park (UMD) a $800,000 grant for the Open Islamicate Texts Initiative’s Arabic-script Optical Character Recognition Project (OpenITI AOCP).

The project is led by Matthew Thomas Miller (Roshan Institute for Persian Studies at UMD), Maxim Romanov (University of Vienna), Sarah Bowen Savant (Aga Khan University), David Smith (Northeastern University), and Raffaele Viglianti (Maryland Institute for Technology in the Humanities at UMD). SHARIAsource, a project of the Program in Islamic Law (PIL) at Harvard Law School (both led by Intisar Rabb), provided significant support for the initial technical infrastructure upon which this project will build (i.e., CorpusBuilder 1.0) and they will also play a leading role in the technical development portion of OpenITI AOCP.

OpenITI AOCP will catalyze the digitization of the Persian and Arabic written traditions by addressing the central technical and organizational impediments stymying the development of improved OCR for Arabic-script languages. Through a unique interdisciplinary collaboration between humanities scholars, computer scientists, developers, library scientists, and digital humanists, OpenITI AOCP will forge CorpusBuilder 1.0 — an OCR pipeline and post-correction interface — into a user-friendly digital text production pipeline with a wide range of new OCR enhancements and expanded text export functionality. The project will also include a series of workshops, a full corpus development pilot, and a Persian and Arabic typeface inventory, all of which will inform the development of the technical components in important ways.

Read full details. Image credit: Open Islamicate Texts Initiative (OpenITI)