Glosbe

What it is

Glosbe is a free, crowd-sourced online dictionary and translation memory that spans a huge number of languages, including many low-resource ones. Anyone can contribute entries and translations, and the data is openly available, which makes it a common source for people assembling language datasets.

Context

Because it is open and broad, Glosbe often gets pulled in as a convenient source of data, and that is where both its value and its risks show up.

One workshop participant shared a story. They gave a student a dataset to work with, and the resulting model behaved poorly for reasons no one could pin down. The cause turned out to be that the student had augmented the dataset with Glosbe data, assuming that more data would improve performance. Instead, the crowd-sourced entries were almost all wrong and dragged the results down. Crowd-sourced data is easy to access, but hard to vet, and "more data" is not the same as "better data."

On the positive side, Glosbe is clearly documented. Its help pages describe how the system works, what its limitations are, what it is intended for, and how to get contact the developers.

Upsides

  • Community-sourced. Entries come from people who actually use the languages, which can reach languages and senses that top-down resources miss.
  • Transparency. Glosbe clearly documents how it works, its limitations, its intended uses, and contact information (help pages). This openness allows researchers and communities to evaluate the tool, unlike Ojibwe Chat.

Downsides

  • The data can be wrong. Crowd-sourced entries are uneven and there is no built-in vetting. Bad data can quietly degrade whatever is built on top of it.
  • Contributions can violate data sovereignty, with no recourse. Unofficial representatives or non-community members can contribute data about a community they have no permission to speak for, and there is no mechanism for that community to have the data corrected or removed. This is a direct Data sovereignty problem: consent may not have been given, and deletion is not possible.
Updated
Supported By the National Science Foundation Award 2542375.