Transkribus
What it is
Transkribus is a platform for text recognition: turning scanned handwritten or printed documents into machine-readable, searchable text (handwritten-text recognition, or HTR, and OCR). It is widely used to digitize archives and legacy materials, and it is one of the tools communities use when bringing old manuscripts, field notes, and records into a usable digital form. It is operated by READ-COOP SCE, a member-owned European cooperative based in Innsbruck, Austria. A free tier (a monthly credit allowance) and paid credit packages are available.
For language work, Transkribus most often shows up as the digitization step, paired with somewhere to actually store and govern the results (for example feeding a Mukurtu CMS archive).
Much of what is worth saying about Transkribus for community use can be read directly off its terms and privacy policy, which are unusually clear. They are worth reading before uploading anything sensitive, because the cloud platform's default arrangement involves real trade-offs that are easy to miss.
Upsides
- You keep ownership. The terms state that your uploaded material, the processed results, and any models you train "remain yours". You retain ownership and intellectual-property rights (§6.2.1). This matters for Data sovereignty: the community's documents and transcriptions are not signed over to the operator.
- Run by a cooperative, under EU data protection. READ-COOP is member-owned, and platform data is stored on servers in Austria under the GDPR, with a "right to erasure." That is a meaningfully different governance and legal posture than a US Big-Tech cloud.
- A way to avoid the cloud entirely. The Transkribus Expert client is released as open-source (GPLv3), and an on-premises option exists, so it is possible to keep processing closer to home rather than sending everything to the hosted service.
Downsides
- Uploading grants a license to "improve" their models. By using the cloud service you grant READ-COOP a worldwide license to use your content "for the purpose of providing and improving our Products and Services" (§6.2.2), which explicitly permits using uploaded material to enhance their recognition models. Even though you own the material, your community's documents can, by default, feed a shared commercial system. This is the central Data sovereignty tension here.
- Sensitive and personal material is largely off-limits without extra paperwork. The terms prohibit uploading personal data without a signed Data Processing Addendum (§6.2.3) and forbid "special category" data under Article 9 GDPR entirely (§10.4). This category could plausibly include culturally sensitive or community-identifying materials. The legal responsibility for getting this right is placed on the user (§10.3), and a community may not realize its materials fall under it.
- The capable version is the proprietary cloud one. The open-source client and on-prem options exist, but the fullest functionality lives in the proprietary hosted platform.