Putting Data Online

What it is

This example is not a single tool but a recurring decision: whether and how to put a community's language materials online. The materials might be audio recordings, transcripts, dictionaries, or video lessons. The access might be fully public or gated behind some form of access control. The decision sits at the center of a hard tradeoff, and reasonable people in the same community might have different opinions about the right choice.

Context

Many language programs, including ones represented at the workshop, sit on large archives of recordings that very few community members ever access. The same barriers that protect culturally sensitive material are also barriers to the community itself. Even when someone is technically allowed access, getting to the material often means showing up in person, during set hours, with time to spare. Unfortunately, regular free time is a luxury that many people in indigenous communities, stretched across jobs and family, don't feel they can afford. Material that is protected but unreachable does not do the community much good.

Putting it online lowers those barriers. It also introduces risks that are genuinely hard to reason about, because consent given today has to hold up against uses no one can foresee.

We have already watched this play out. People posted photos to sites like Flickr years ago, often under permissive licenses, with no possible way to predict the age of AI coming. In 2019 IBM scraped roughly a million of them to build a facial-recognition training set, without notifying the photographers or the people pictured (NBC News, 2019). The same is now happening with books, at a scale that's hard to fathom. To train its Llama models, Meta downloaded LibGen (a pirated "shadow library" of more than 7.5 million books) choosing piracy over licensing because licensing was too slow and expensive, with signoff reportedly going all the way up to Mark Zuckerberg, even as employees discussed how to mask what they had done (The Atlantic, 2025).

Right now, policy is largely failing to keep up: the central legal question (whether training on copyrighted work without a license counts as "fair use") is still unresolved in the courts, so in practice companies have trained on whatever they can reach (The Atlantic, 2024). That vacuum pushes communities toward technical protections (access control, non-redistribution, formats that resist bulk download, etc.). But technical measures are also leaky: anything a person can reach, a scraper usually can too. The uncomfortable conclusion is that neither side is sufficient alone, and the policy side that should carry the weight is, for now, the one that's broken.

Upsides

  • Access for the community. Online access removes the time, travel, and scheduling barriers that keep most members away from material they're entitled to.
  • Access for research and revitalization work. Documented materials become usable for teaching, for revitalization projects, and for research that can feed back into the community instead of sitting inert in an archive.

Downsides

  • Culturally sensitive material loses its protection. Some material is not meant for general circulation, and "online" (even access-controlled) weakens the social and physical barriers that previously governed who saw it and when.
  • Consent becomes a promise about an unknowable future. The real risk isn't only misuse today, it's that posting now may constitute de facto consent to future uses (training data, resale, derivative tools) that don't exist yet and can't be described in any release form signed today. Policy is currently failing to protect against this, and technical measures are leaky, so the risk is real and hard to mitigate.
Updated
Supported By the National Science Foundation Award 2542375.