Domain specific software

We are working on CLaaS, CLARIAH as a Service. It will provide Domain Services, like Data & Models (Audio, Video, Text, Images, Structured Data), Transformation (Workflow, Provenance, Curation and Evaluation) and Interaction (Workspace, Execution, CMS, UI & UX).

The domain specific software we develop is specialised enough to be useful for specific groups of researchers and generic enough to support a viable amount of users. We love to share some examples.

OCR and HTR software

We develop and customise software to enhance the OCR output on historic newspapers. The typesetting on those historic newspapers may look like calligraphy, the ink of the typeset may be fading, the column-style layout may pose problems, and advertisements may be identified as articles because in those days they didn’t have any illustrations. The same goes for medieval manuscripts or early modern documents. We adjust Handwritten Text Recognition software to recognise each character despite the unique detailing every individual clerk adds.

Extracting and linking entities

We link entities end-to-end: we extract entities, using customised NLP tools like Named Entity Recognition. And to link these named entities the right way, we develop tools for name disambiguation and word-sense disambiguation in close cooperation with our linguists at the Meertens Institute and digital humanities researchers at DHLab.

Geo-toolkit and fuzzy matching

We also develop a geo-toolkit for all disciplines of history at every spatial geographic level. And last but not least we apply fuzzy matching in linking our data to allow for matches that may be less than 100% perfect when finding correspondences between segments of a text and entries in a database.