Yaduha (LLM-RBMT)

What it is

Yaduha is a framework for building machine translators for extremely low-resource languages. It has mostly been applied to Owens Valley Paiute (OVP), a critically endangered Indigenous language of California with fewer than ten fluent speakers. The system translates English into OVP, but the way it does so is the point of this example.

A language model never freely generates Paiute. Instead, the model's job is to read an English sentence and fill in a structured form: a set of grammar "slots" (subject, verb, object, tense, number, and so on) defined by community-curated rules. A separate, deterministic program then renders that filled-in form into the actual OVP sentence. The model handles the messy English-understanding part while the hand-built grammar makes sure the sentence is grammatical. This is an instance of constrained decoding.

Early versions relied on large proprietary models for this step, but recent work has shown the same approach works with small, open-weight models (two fine-tuned Qwen2.5-3B adapters) that run locally on a single consumer GPU, with no proprietary API at either training or translation time. That matters because it can help address concerns around environmental impact and data sovereignty which look different from one community to the next.

What makes this example worth contrasting with something like Ojibwe Chat is what happens when the system reaches the edge of what it can reliably say. Ojibwe Chat, a thin wrapper around a general-purpose model, answers every prompt with the same fluent confidence even when it is wrong. Yaduha is built so that its uncertainty is visible. When an input word falls outside the curated vocabulary, the renderer emits a placeholder token (for "Tom caught a fish" it produces [Tom]-ii pagwi-neika a-[catch]-ku, keeping [Tom] and [catch] as visible gaps) rather than inventing a fluent-sounding substitute. The structured form is available for anyone to audit, and a back-translation lets the system say, in effect, "I can't say that exactly in OVP, but here is a set of simple sentences that approximates it." In a revitalization setting, where outputs may end up in teaching materials and the pool of speakers able to catch an error is tiny, that difference matters a lot.

Upsides

  • Careful about what data it uses. The research uses AI in many ways to compare different approaches, and in doing so OVP data is sent to LLMs, but only public, non-culturally-sensitive material: basic vocabulary and grammar. No sentences elicited from speakers, nothing drawn from stories or private archives.
  • Aimed at sovereignty-respecting systems by design. The point of the project is to move toward translation systems that take data sovereignty and environmental impact seriously from the start. One concrete step is showing the approach can run on local, open-weight models with no language data sent to an external service at all.

Downsides

  • Helping users interpret the output is an open problem. Building the system to show its gaps is only half the work; making sure users understand what the outputs are, and aren't, before they rely on them is its own task that the project has not yet solved.
  • The limits of the approach aren't yet clear. How well it actually serves learners, what it does to or for the pedagogy, and where the method breaks down are open questions, not settled ones.
Updated
Supported By the National Science Foundation Award 2542375.