You Deserve The Cluster
The case for running your historical collections through research computing, and why it costs less than you think.
I have sensed a quiet assumption in the humanities that high-performance computing belongs to someone else: to the physicists modeling gravitational waves or the geneticists sequencing entire populations or to the economists running enormous simulations. These daunting clusters, with their GPUs, their job schedulers, and their cryptic scripts, feel like infrastructure built for disciplines that trade in numbers, not narratives.
As a historian and archivist, I spent years thinking the same thing. But my projects—and yours, fellow historian—might just belong on that cluster too. Consider the handwritten records you’ve been photographing on research trips; or the correspondence you’ve been transcribing in the reading room; or those thousand-page minute books, the land surveys, the immigration registers rich with details of space and place. These collections are exactly the kind of work that research computing was built to support. And the tools available right now make the case almost embarrassingly easy.
What changed?
A generation of powerful, open-source vision-language models arrived. These are models that can look at an image of a faded, ink-stained, overlapping-cursive index card and return structured, searchable data. Not perfect data. But useful data: the kind that turns a microfilm scroll into a queryable database. They handle handwriting, mixed layouts, rubber stamps, marginal annotations; the visual noise and contextual complexity that defeated traditional OCR for decades. And because they are open-source, they can run on hardware your institution might already own.
Two projects, one architecture
My colleagues and I have been testing this across two projects that sit at very different ends of the archival spectrum.
The first involves over a million government administrative index cards held at a federal archive: cards dense with classification codes, agency stamps, and idiosyncratic cursive spanning decades of bureaucratic correspondence. The second involves handwritten court
records from the eighteenth century: daily administrative logs of civil litigation, five to fifteen case entries per page, produced by clerks whose penmanship was optimized for speed, not legibility.
Even though these records come from different centuries, different hands, different archives, the pipelines we built to process them are effectively the same. We take scanned images, feed them to a large vision-language model with carefully designed prompts, and get back structured output in the form of JSON records with fields like dates, agencies, parties, classifications, and body text. The model and pipeline handle the reading and the data extraction. And critically, it runs not on an expensive commercial API but on university research computing nodes equipped with GPUs, using open-source models we downloaded and deployed ourselves.
You probably already have access
The barrier to entry is lower than you think. This is the part I want historians to hear most clearly. If you are affiliated with a research university, a library consortium, a national lab, or any institution with a research computing group, you almost certainly have access to GPU-equipped clusters. These machines sit in basements and server rooms running jobs for chemists and engineers, and they have capacity. Research computing groups are, in my experience, genuinely eager to support humanities projects. They want the diversity of use cases and they want to demonstrate broad impact.
For the index card project, we run a 72-billion-parameter open-source vision model on NVIDIA A100 GPUs through a Slurm-managed pipeline. It processes thousands of cards per hour, around the clock, for the cost of allocated compute time, which, at a university, often means free or close to it. No per-token API fees. No data leaving your institution’s network. No licensing negotiations. Open-weight models like Qwen, LLaMA, and Mistral can be downloaded and deployed on institutional hardware with nothing more than a few configuration files and some experimentation.
Compare that to running the same work through a commercial API. At current pricing, processing over a million images through a cloud vision-language model would cost tens of thousands of dollars. On a university cluster, it costs compute time that was already budgeted and even where it isn’t, the cost is a fraction of the commercial alternative.
Privacy by default
There is another benefit that matters enormously for the kinds of records historians work with. When you run models on institutional infrastructure, your data never leaves the building. No images uploaded to a third-party server, no extracted text passing through someone else’s cloud. For projects involving records tied to individual people, families, and communities — personnel files, medical histories, correspondence, community administrative records — this is not a minor convenience but an ethical requirement. Running on local compute gives you that by default.
The real barrier is permission: your own
The honest obstacle for most historians is not entirely technical. It is the feeling that their project is too small, too niche, too humanistic to justify claiming space on a shared computing resource. Perhaps processing a few thousand court documents does not warrant the same infrastructure that simulates protein folding.
But consider what is actually happening when you run a model over a collection of historical documents. You are converting unstructured, inaccessible primary sources into structured, searchable, analyzable data. You are building something that other researchers, genealogists, and communities can potentially use. You are doing exactly what research infrastructure exists to support.
Remember, as well, that the scales vary. Not every project is a million images. Sometimes it is four hundred pages of minute books, or a single box of correspondence. The pipeline scales down just as well as it scales up. A few hundred images can process in an afternoon. And the experience of building that first pipeline — writing your first extraction prompt, submitting your first batch job, seeing structured records come back from documents you thought were illegible — changes how you think about what is possible for every collection you encounter afterward.
Your project is not too small nor too messy. The resources are closer than you think. And the learning curve is shorter than the one you already climbed to get into the archive.
Github: https://github.com/uvalawlibrary/hpc-vlm-starter




Great post. The only thing I would add is we can do a lot with older generation A100 GPUs and these are often idle as the demand is focused on newer H100s for frontier work. I’m able to run large OCR and NER jobs most days without waiting.
This is awesome! I just applied to use these resources at my university. Loren, which open source models do you use? I’ve been indexing early modern manuscript records with Codex.