Text

We support textual research on four levels. The first one is close reading and manual annotation. The second one is advanced search of annotated text collections. The third one is statistical methods (machine learning, text mining). And the last one is deriving structured data from annotated text. Our Team Lead is Hennie Brugman.

We work on software that presents, searches, annotates and analyses text. Our goal is to create a common pipeline, enabling scholars and engineers to ingest, process and publish textual data – be it XML, raw text, or scans – in a common distributed and modular infrastructure adaptable to the different scholarly domains at the HuC and outside.

Text analysis is a large part of the responsibilities of our team. The focus lies on tools for linguistic, syntactic and semantic analysis, NER, as well as other information extraction algorithms. The total suite of tooling should automatically generate a context of metadata and annotations around a text, and enable users to confirm, reject or correct these annotations. The work of the team in this field interacts closely with more experimental development in R&D, at the DHLab and various research groups in computational science and computational linguistics. Prototypes from these groups can be adopted by the team if they can be improved to a certain level of maturity.

We are also responsible for packaging products and product components into interactive environments that are optimised for the specific needs of researchers or research projects.

Team

  • Gijsjan Brouwer
  • Hennie Brugman (Team Lead) (Publications)
  • Hayco de Jong
  • Bas Leenknegt

Product groups

We work on products in the following product groups:

  • Generic but flexible front end solutions: we try to exploit shared functionality between user interfaces while maintaining the flexibility to meet special project requirements.
  • Tooling for several traditional and more innovative ways to annotate, and exploit these annotations.
  • Searching in large text collections in ways that list/summarize/organize/visualize large amounts of hits.
  • A state of the art text repository that supports the full life cycle of text documents in all their forms and versions, and in relation to all associated annotations and enrichments.
  • Pipelines and tools for analysing, enriching and processing text documents.

Sample Projects

  • CLARIAH Plus, creates a national digital infrastructure for the humanities.
  • EviDENce┬ástudies how eyewitnesses have reported on violence, and how this may have changed over time.
  • HiTimeP, is an annotation interface for entity linking.
  • REPUBLIC, will provide access to the resolutions of the States General (1576-1796): more than half a million pages with handwritten and printed political information.

Code repository

GitHub

Contact

hennie.brugman@di.huc.knaw.nl

Stories