frontmatter

title: Vision language model
aliases: [vision language model, vision-language model, VLM]
tags: [term]
category: AI & NLP
summary: A model that takes images as well as text, letting one general-purpose system read documents, describe pictures, and answer questions about what it sees.

Vision language model

A language model that can also take images as input, not just text. The same system can be shown a photo or a scanned page and asked to describe it, pull out the text, answer questions about it, or summarize it. They are sometimes called multimodal models, and the multimodal abilities of tools like ChatGPT come from this kind of model.

For language work, VLMs are interesting because one general-purpose tool can do jobs that used to need separate specialized software, including text recognition, where a VLM can often read a handwritten page or a non-standard orthography without being specially trained for it. That flexibility is genuinely useful for messy archival material.

But the same probabilistic caution that applies to any language model applies here, and more sharply for documents: a VLM does not strictly transcribe, it predicts a plausible reading, so it can silently "correct," skip, or invent text rather than flagging that it is unsure. The most capable VLMs are also typically large closed-weight services, which raises the usual questions about who sees the uploaded documents and whether a community can run the tool on its own terms.

Created Jun 19, 2026 · Updated Jun 19, 2026