On The Virtue of Small AI
Ollama and Hugging Face
A lot of folks are understandably worried about giving money to OpenAI, and to a lesser extent, Anthropic. The reasons vary from expense to privacy to environmentalism to copyright to even fears of the paperclip problem. Whatever the reason, cloud AI has become a lightning rod for a new kind of resistance to Big Tech.
But what if I told you there was another way to use AI?
For the most technically savvy historians, the answer is obvious. However, I have found in my many conversations that the alternative—free, open-source, local—is little known. And these little AIs, strangely enough, often work best for our historical purposes.
Free AI on Your Mac
Now, for whatever reason, Apple has become the beloved computing device of people who would rather overlook its rather suspect labor history. I myself now own a MacBook, and I own it for one reason: it can run AI.
LLMs run best on GPUs with large amounts of RAM. Until the last few years, GPUs were mostly used for gaming (which required lots of fast calculations to render animations) and for Bitcoin (which required fast calculations to undermine the global currency regime). It turned out, unexpectedly, that these same GPUs were good for doing the calculations necessary for LLMs to run.
Unlike PCs, which separate the memory for the CPU from the GPU, Macs have integrated memory. So even PCs with lots of RAM generally can’t run very large LLMs without a specialized chip. On Macs, the RAM that you have runs everything. So on my Mac (a luxe M4 Max with 128 GB of RAM) I can run nearly any model available. Even lower-end Macs can run meaningfully sized models. 8 GB is too little; 16 GB lets you run small models (3–8B parameters) usefully. 32 GB and up opens the door to the genuinely smart ones. Disk space can be eaten quickly (each model can be tens of GB) so be careful.
Local AI
A few years ago, when I started playing with LLMs, you needed to get seriously under the hood. Long was the night when I monkeyed around with models I downloaded from Hugging Face and tried to get their idiosyncratic details running in Python. I did this almost exclusively on the big computing cluster at Hopkins, and while I could get it working—and even did some cool research on OCR—it was a gigantic hassle.
Hugging Face has got to go down as the silliest name in economic history. And it will go down in economic history because it is a vast repository of free, open-weight models. I want to use a historical analogy, but at no point were steam engines or assembly lines freely given away. You can download the “weights” of LLM models (which are the important parts) and do whatever you want with them.1 End of story. It is pretty amazing. These models are not exactly the cutting-edge “frontier” models of OpenAI and Claude, but they are pretty dang close.
That said, it can be hard to learn how to use these models. The documentation, while extensive, is pretty alienating.
Here I offer a brief aside (and an apology for the intended pun): the Hugging Face founders apparently named it after this emoji 🤗, which depicts a face with a hug. Aww. Of course, I assumed it was named after the semi-larval monster from Alien, the facehugger. Future literary scholars will make a lot of this slippage, I think.
The larger point is that this repository, while amazing for the technically inclined, can be daunting.
A novice user would be better served using Ollama. Ollama started as a way to use the free models released by Meta/Facebook called Llama, but it has since expanded. If you look at its library, you can see many easy-to-use models. You can download the app and it runs like a chatbot on your computer.
For the more advanced user, Ollama has a great feature: Ollama server. You start up an Ollama server on your computer and you can interact with it locally like you would with ChatGPT or Claude.2 In this way, Ollama has very easy integration into other Python code, via an API.
Not having to understand how to run the model is the entire point of ChatGPT or Claude—and you can do it on your computer for free.
The models you can run are pretty good at a range of tasks. More importantly, unless you are in a rush, they are an easy way to scale up your projects. You can let Llama do OCR for you over a week if you want. You can run Claude Code with Qwen, no tokens needed. In the last few months, moreover, we have seen an explosion of models specialized for the Mac that use “MLX“ and have really increased token-generation speed.
Ollama addresses many of the key concerns that critics have. Your data stays local. Apple Silicon is more power efficient than anything in the cloud (which runs on NVIDIA GPUs). You aren’t handing over money to potential monopolists. You aren’t supporting a potential robot uprising.
And it is just cool to have it right there on your computer.
The Virtue of Small Models
When I am writing code to accomplish history tasks (OCR, OCR correction, fact-checking, and the like), I always run experiments. The assumption I used to have was that the biggest, newest model was the best model to use.
That is incorrect.
Instead, what I often do is break my processes into steps. For instance, I often want to read documents and pull out structured information. I used to do this in one step with an expensive API call to OpenAI. Nowadays, I break that process into steps. It works better, and it is cheaper. Each step—OCR, OCR correction, named-entity recognition, JSON cleaning—uses a different model. I use a big, smart model to check the results (as well as spot-checking the results myself), but I often find that older, smaller models (like qwen-2.5:7b) do a better job than the recent big boys.
As you integrate Ollama into your workflow, experiment with what works. It is free.
Next Steps
Get a Mac if you don’t have one. Download Ollama.
Use ChatGPT or Claude one last time to explain how to get it running, and don’t look back.
Weights are the trained parameters of a model—the numerical values that determine how it responds to input.
It should also be noted, for the technical reader, that running a Docker or Singularity instance of Ollama on a high-performance cluster allows a level of abstraction that is very useful, especially on locked-down HPCs that don’t allow you to install software.


