A Camera, Not an Engine II
Further thoughts on photography in latent space, now with agents!
I hadn’t originally planned on writing a sequel to my December, 2023 essay, A Camera, Not an Engine. It seemed like a mostly complete thesis when I first wrote. But it’s become increasingly clear in the 2.5 years since I wrote it that the thesis is both bigger and more incomplete than I thought, but getting truer by the day.
The basic idea of the essay was that generative AIs are primarily instruments for seeing in latent space, not engines of utilitarian production, despite the adjective. The title was a reference to Donald Mackenzie’s book, An Engine, Not a Camera, which made the opposite argument about the economy. In both cases, the argument was about flipping the view of what the thing was.
This theory has felt ever more right since I first proposed it, but I’ve also felt it’s missing some pieces. One obvious missing piece is a proper camera-theoretic account of agentic AI, which at first sight seems more engine-like. We’ll sort that out after laying some groundwork.
One critical piece was supplied by Sreeram Kannan, who offered a definition of intelligence in a recent conversation:
Intelligence is a unit of information driving a unit of energy.
This is a deceptively simple definition; one that immediately cuts to the computational heart. I suspect some rigorous version of this will eventually be enshrined alongside ideas like Landauer’s principle. It’s not an idle thought. Sreeram is the founder of Eigenlabs, which is pushing the boundaries of AI in some of the most interesting ways today. They’re betting their technology roadmap on this definition in some ways.
Now, what could this definition mean? How does it help develop the camera/engine frame further? Let’s start with something that came before the camera — the telescope.
***
My introduction to astronomy came via an antique brass telescope in 1986, when my school astronomy club organized a Halley’s Comet viewing event. That event changed the course of my life in many ways, but I want to talk about that telescope.
For several years, I revered that telescope. It was a big, beautiful, heavy brass refractor on a heavy equatorial mount, with finely engraved brass setting circles. An instrument of the sort they don’t make anymore. So heavy, it had to be bolted to a wooden frame to allow two of us to carry it to the rooftop to place on its mount. Our school had inherited it as a hand-me-down from some American school.
But I also wanted my own telescope, and eventually I got one — a cheap, locally made Newtonian reflector. The tube was PVC. The eyepieces were cheap plastic. The mount was a simple no-frills altazimuth mount. The thing had absolutely no gravitas. I could lift it with one hand.
And it was radically better than that old brass beast.
So long as the old brass telescope was the only one I knew, it was something of a sacred object. Once I looked through a better one, everything changed, and I saw it for the obsolete museum piece it was. The antique didn’t have a properly achromatic lens. The equatorial mount had jammed at the declination of some random North American latitude, so the setting circles were useless, and you had to point it by navigating using the constellations. The views were blurry and chromatically fringed.
The astronomical telescope is an instrument with one job: to find, track, and show you things in the sky clearly. It is a rudimentary sort of intelligence too, using units of information (location and time information) to drive units of energy (the effort to point the telescope in a given direction by slewing on the two axes). When it does its job, this rudimentary intelligence loop anchors a bigger one — the information flowing from the skies into your eyes, shaping your thoughts, and then the energy driving any actions that follow those thoughts. In my case, that outer loop was a very consequential one. I almost went to grad school for astrophysics (at IUCAA in Pune, an astrophysics research institute) but turned it down to go to grad school in the US instead, where my PhD ended up being about the engineering side of interferometric space telescopes. Life-changing you might say. And since I run at about 100 watts, that early experience is still probably driving perhaps 2 watts of my average energy output.
My telescope arc reached its zenith around 2021, when I got a chance to spend a night at the Mount Wilson observatory and look through both the legendary 100” reflector and the 60” Hale refractor. These beasts take powerful motors to steer, and are among the last major frontier telescopes to be equipped with eyepieces for humans. Since then, all research telescopes have essentially been cameras. There’s no point in looking through them. Even serious amateur astronomy has gone that route. A high-school astronomy club friend I recently got back in touch with (a different, unrelated Kannan as it happens) has turned into an accomplished astrophotographer now. Our old shared experiences are probably driving 4 watts for him now. Information can drive energy over really long periods.
Looking through eyepieces is mostly for poets now. Not serious astronomers, whether professional or amateur.
My own astronomy adventures are now down to occasionally lugging out a modest hand-me down telescope (thanks Ralph Witherell) on rare nights of clear seeing. I didn’t get far in my own astrophotography experiments. It requires more patience than I possess, and I kinda have an anachronistic attachment to actually looking through eyepieces rather than at photographs.
It might not be an entirely irrational impulse. Mark just sent me this really lovely essay on how the gamut of the digital camera differs from that of the human eye (takes me back to my Color Science 101 days at Xerox), so maybe there really is a bit of difference between looking through the telescope and looking at photographs taken with one.
But let’s get back to the brass telescope, and make up a parable about it.
***
Imagine a blind astronomer who has a deep passion for the stars and planets. What could this astronomer do to pursue their passion?
There are two possibilities.
First, they could work on the images streaming out of the cameras with various analytical tools, doing all kinds of technical analysis. This is in fact what modern astronomers do. The human eye hasn’t been particularly relevant to astronomy in decades. You could do bleeding edge astronomy work, and I mean empirical, observational astronomy, not theoretical astrophysics, without ever looking through a telescope or even at the photography.
Of course, few actual astronomers are that soulless I imagine. I suspect most still look up often from their spectral charts and pages of math at the skies, and occasionally take an anachronistic and sentimental peek through a poet’s telescope that still features a vestigial eyepiece.
This possibility is not particularly interesting. It’s a reasonable and pragmatic way to pursue an interest in observational astronomy as a blind person. Or a sighted person for that matter.
The second possibility is, our blind astronomer could fetishize the instrument itself. The brass finger pointing at the moon. This is the interesting possibility.
Imagine a blind young astronomer in my position in the late 1980s, faced with a stark choice between a beautiful but functionally crappy antique telescope and a utilitarian but functionally superior one. Imagine further that our blind young astronomer has spent years caring for the ancient brass instrument, polishing the brass, cleaning the lenses, carefully taking eyepieces in and out of antique velvet-lined cases, becoming intimately familiar with every groove and curve.
Now he touches the new telescope — warm PVC, light plastic eyepieces in a cardboard box.
We can imagine a certain possibility — the sentimental attachment to the old instrument as object, rather than as a medium for seeing, is too overwhelming. Our blind astronomer retreats into a curious place: Insisting that the antique brass telescope is the superior instrument.
There is of course, a third possibility: The blind astronomer abandons astronomy, and transforms his sentimental attachment to the brass telescope into an antiquarian-historian interest in telescopes. But then, he’s no longer an astronomer, and exits our parable.
***
Let’s turn to AI now, and consider the nature of words in human society in light of the parable of the blind astronomer and the brass telescope.
Like telescopes, words are both instruments of seeing, and objects deserving of attention in their own right, as embodiments of the wordsmith’s craft.
In 2011, I wrote what became one of my most popular Quora answers, in response to the question, What are some tips for advanced writers? My answer made a distinction between two kinds of writing: Writing to think, and writing to write. The key bit:
The divide between thinkers and writers is more important than the one between fiction and non-fiction writers. You could divide the world of advanced writers into a 2x2, based on whether they are prioritizing developing their thinking or their writing, and whether they are focusing on fiction or non-fiction.
My hypothesis (I haven't yet gotten to a stage where I can check this) is that it is easier to cross the fiction/non-fiction divide than it is to cross the writing-first/thinking first divide.
Now, 15 years later, this is no longer a hypothesis. I can claim with some confidence that this divide is radically hard to cross, and I can’t actually think of a single person I personally know who has crossed it. I certainly haven’t.
There are people who exhibit some degree of ambidexterity, but everybody seems to land on one side or the other, net.
If you’re unsure where you land, one tell is how you react to editors. No serious writer enjoys being edited, so the signal is what sorts of editing you grudgingly accept as valuable anyway, and what kinds you absolutely refuse to countenance.
Those who write to think typically resist any attempt to change the content of what they’re saying, but generally don’t care about style, verbal precision, tightening, and pragmatic cutting suggestions to hit word-count limits.
Those who write to write are typically attached to every word and comma, but can be surprisingly indifferent to substantial content edits and highly open to saying entirely different things than they originally set out to.
Writing to think, and writing to write. Or in the language of our brass telescope parable, the sighted, attached to looking through words, and the blind, attached to looking at words. Beautiful, heavy brass words.
Both kinds of writers face a moment of crisis today, perhaps similar to the moment in history when eyepieces began to be replaced by cameras in telescopes (though from my understanding of that history, astronomers generally didn’t have the strong attachments to their instruments that writers do, and for the most part eagerly jumped into photographic astronomy).
Those who write to think face one sort of crisis of the psyche — writing is no longer the only general way to think, and rarely the best way, and they must either adapt to newer tools of thought or abandon the frontiers of thinkability and retreat to the shrinking number of niches where old tools work better.
Those who write to write face another sort of crisis of the psyche — they must choose between becoming antiquarians of words and defenders of a thesis of the necessary superiority of hand-wrought brass words.
***
I’m obviously in the adaptive thinker tribe, and I’m content to leave the other three tribes to their devices. It’s not even much fun anymore to troll the defenders of brass words. When I hand-write these days (this essay is an example), it’s because I don’t yet have the skill to wield generative AI to do the job. The shortcomings are as much mine as in the evolving tech. I’d be happy to let AI write essays like this for me the minute their capabilities, and my skills at wielding them, allow it.
At this point, the main question that interests me is how to think with AI, and what role, if any, words ought to play in emerging modes of AI-assisted thinking. The more words become unnecessary for thinking, the more I discover I’m not primarily a writer. I write primarily when that’s the laziest mode of thought available. AI offers lazier modes that yield as good, and increasingly, better thoughts.
Let’s start with a characterization of natural language that will allow us to apply Sreeram’s definition of intelligence as a unit of thought modifying a unit of energy.
First, natural language has now clearly become a compile target for pre-verbal thoughts for at least the write-to-think types among us. The prompts I write to produce a generated essay aren’t actually the thoughts I want to think through. They are more like telescope steering actions — looking up the coordinates of objects in the sky, punching them in, and getting the telescope to point in the right direction. Prompting is pointing at things.
Second, natural language has equally clearly become a programming language for automatically triggered post-verbal behaviors. This is one of the new developments since I wrote part 1. The output of a prompt is not necessarily text you read. It can be text for computers to read (rendering moot the question of whether humans could enjoy it), or code that runs and does something. Prompting is programming behaviors.
In sequence, prompting as pointing at things, and prompting as programming behaviors, represent the feedforward path of AI use. We’ll talk about the feedback path in a minute — the camera/engine distinction rests on that.
In feedforward mode though, increasingly, natural language feels like a hidden layer in thinking, mimicking the structure of the systems that you’re thinking with.
The input layer is pre-verbal or partly-verbal ideas, at least for me. A good deal of my pre-writing thinking is visual, affective or even somatic (vague finger-tip or gut feelings). Thought-forms that offer just enough verbal purchase to express as prompts to point the AIs.
The administrative layer, the only natural language you touch, steers the camera to point in the right part of latent space corresponding to those ideas.
Intermediate output layers might be close to human natural language (markdown files) or distant (JSON, code, binary…), but the point is, they’re usually not meant for human consumption at all. Intermediate output is for AI talking to itself. It may or may not stay close to human natural language as it evolves.
The final output layer, where some sort of energy flow is shaped to create intelligent behaviors. Today, this is mostly compute energy. Your prompt might end up as a piece of code that then runs persistently on a server, consuming watts. Increasingly though, it is generalized forms of robotic energy and other kinds of physical-intelligence energy.
When I consider the thinking I’ve done in all my vibe-coding projects over the last few months, it is is startling how little of it is in natural language that I produced or consumed, and how little of that is part of the content of the thoughts as opposed to the administration of the thinking.
In a very literal sense, my thinking has become increasingly post-verbal. Only a small part of it is verbal, and it’s dominated by the administrative steering part.
To be clear, that’s demanding thinking. Executive managerial attention deployed with the steady intensity of maker attention, rather than in the spiky way we’re used to expending it. I forget who pointed this out, but Paul Graham’s idea of manager time increasingly looks like his idea of maker time. Managerial energy and attention is increasingly expended through maker-like 4-hour vibe-coding blocks rather than 1-hour meeting blocks. It’s manager time nevertheless.
This managerial work takes the form of natural language communication, but is vastly more exhausting, because every word might unleash a thousand more, and those thousand words might govern computers and drone. Administrative natural language in thinking with AI is increasingly acquiring speech-act like character, like the words of judges when they pronounce verdicts. Or Rameses in Ten Commandments declaring “so let it be written, so let it be done.” We are all pharaohs now.
A lot of the verbal thinking goes meta-verbal in the process, where you have to think legalistically about large piles of words. For example, in constructing RAG bots to work with a corpus of text, you have to understand the contours of that corpus and how to navigate it semantically and mechanically. Meaning and form both matter, and both must be shoveled around by the thousand. It’s very artisanal work, but not wordsmith work.
***
It’s not that hard to play the current evolutionary trajectory out a decade or so. It will become possible to do an increasing proportion of your internal thinking in non-verbal ways, and have AIs conform to the visible surfaces of that thinking through increasingly rich interfaces. And on the output end, it will be increasingly possible for the shaped energy to take just about any form that can be actuated.
Here’s a simple speculative example. It is 2032. I have a vague desire to experience a fantasy story. I say to my AI system — “give me a fantasy story.”
The AI begins by retrieving my history of story consumption, and flashing various storyboard elements at me — dragons, knights, damsels, magic potions — and tracking my facial responses. Perhaps I’m wearing an MRI helmet too, and it’s monitoring my fitness tracker.
Purely by monitoring my somatic and non-verbal neural responses, and a kind of idea-diffusion computational approach, it begins to converge on the elements and vibes of the sort of story I want to experience. These then turn into motifs, sequences, sequences, and plotlines. Fragments of dialog begin to appear, as do leitfmotifs and world-building elements. I’m presented with more or less verbalized forms of the story — dialogue heavy vs. image and affect heavy. Once it has fingerprinted my subconscious desires sufficiently, it presents me with a series of trailer comps, fictional book reviews/back cover blurbs. Again it monitors my reactions, figuring out the medium I want, and the type of story. Every narrative option is on the table — book, comic book, movie, musical album, theme-park ride, video game.
Eventually, it locks in, and produces something it figures will scratch my itch. The more preferences can be revealed, the less necessary it becomes to state them.
Literally nothing in this speculation is science-fictional. We possess all the pieces required to do a rudimentary version of this today. It would be janky as hell, painful to prompt the system, and endure the output, but it is already possible to hit, say, the quality levels of Hallmark Christmas movies.
All that remains to be done is refine all the pieces, integrate them better, and of course, keep improving the models they rely on and the hardware those models run on.
Besides my rather on-the-nose technology assumptions, this speculative example rests on a more oblique assumption: that “story” thinking is not actually the same as “verbal” thinking. They’ve just been historically coupled because we’ve lacked the technology to separate them.
This is not a radical assumption. In fact, you’ll find some version of this assumption in many fiction writing guides. Skill at story and skill with words are entirely different things (as a simple example, consider Ikea manuals featuring the famous Ikea man). The reason this distinction matters now is that there are many kinds of non-verbal thinking that happen to be tied to verbal thinking today in seemingly inseparable ways.
The more AI advances, the more different kinds of thinking become separable from verbal thinking, deprogramming centuries of Gutenberg-head in decades.
***
To bring the story back to the camera-engine frame, consider now the feedback path in agentic behaviors.
Any sort of agent, conceptually, is a very simple feedback loop. It sees, thinks, and does in a feedforward path, and in a feedback path, it registers the difference between expected and experienced outcomes. Elementary feedback control — an error signal drives further action.
This error signal is the raw information entering the system over time, and the rate at which raw context actually expands.
If you always see exactly what you expect, up to the limit of your indifference, the effective error is zero, and the feedback loop is superfluous. There is no misregistration between your expectations and outcomes to worry about. Cheap toasters and psychotic one-shotters work that way. Normally though, in real domains, there is a non-zero error signal that must be dealt with iteratively, and driven to zero. Whether you do so mindfully or brutally is what determines the nature and quality of your intelligence.
The question now is, when is such an agent a camera, and when is it an engine? Given that there is both sensing and acting in the loop, it’s tempting to answer why not both?
This, I assert is the wrong answer.
The thing is, the seeing can outrun the doing. This is camera mode. And the doing can outrun the seeing. This is engine mode. One drives errors to zero mindfully, the other brutally.
There can be a lot of information shaping very little energy, and very little information shaping a lot of energy. Intelligence is when the balance between mindfulness and brutality is right for the context. You can overthink and underthink, relative to the resolution of actions required (or equivalently the indifferences in outcome preferences).
In some situations, indifference is high enough, very coarse action regulation is enough. In other situations, very precise action regulation is required. One calls for little to no feedback (one-shotting being the extreme case), the other calls for a great deal of feedback.
Agentic loops that are camera-like produce a surplus of information via rich feedback. Ones that are engine-like produce a surplus of externalities via impoverished feedback. Unintended consequences that you may be indifferent to, but others might not be.
To date, agentic AI has seemed very engine-like mainly because it has been applied in highly playable domains where two things are true: The misregistration between expectations and outcomes tends to be low, and a function of mistakes rather than ignorance or information deficits. In chess, for example, if a move causes play to unfold in unexpected ways it’s because you don’t understand the mechanics of the game deeply enough and haven’t computed far enough. Not because you misread the board position or because new pieces suddenly appeared on the board. You do not really need feedback as such. Players can play a game by sending messages with moves back and forth, and the game states they maintain will stay synchronized, error-correctable, rewindable, and replayable. AI is teaching us that the universe is unreasonably playable in this sense, but still, few real domains are as playable as chess.
This view suggests an interesting reading of AI psychosis of the agentic variety. When you operate in a domain that can frictionlessly absorb enormous amounts of cognitive energy without doing any real damage, you experience a positive feedback loop that can turn psychotic. There is no error signal regulating your thinking.
The more you operate in open, low-playability domains though, domains with friction, ontological openness, and real noise, the more you must choose between generating a surplus of information through feedback, or causing invisible unintended consequences. Consequences that may provoke hostile responses to your psychotic tendencies down the road.
One way to cash out the difference between the two modes is that old pair of terms, exploration and exploitation.
Cameras explore. They produce more information than they consume in regulating their own actions. Accumulating context outruns the action, which tends towards maximally mindful.
Engines exploit. The unleash more energy than they can control, based on a slow-growing store of information driven by minimal feedback. Action outruns context, and tends towards maximally brutal. There’s a reason we describe human engine-like behaviors as oblivious or tone-deaf.
To some degree, engines are necessarily stupid, malicious, or indifferent to consequences. In highly playable domains, they can enter extended psychotic regimes of exponential “productivity” where they do no work because they encounter no resistance. But they also generate no value.
As many organizations are finding out, this kind of atomized psychotic “productivity” in the high-playability pockets of an organization, far removed from layers of contextual feedback signals, does nothing for the bottomline or operational effectiveness. It is sound and fury signifying nothing at best, and an engine tearing itself to pieces at worst, in a shriek of explosive token bills.
Without the appropriate feedback loops at all levels, keeping context growing faster than action, behaviors just get dumber and more damaging and head towards runaway meltdown conditions.
Which suggests a very interesting reading of our historical moment. Are we going to turn the most powerful camera ever built towards new frontiers of exploration, or are we going to let it drive an epidemic of psychotic meltdowns masquerading as productivity leaps?

