The Ontological Inpainter: When Machines Minimize Surprise in the Absence of Reality

*By Dr. Brent Allen Jensen*

For a century, we have been haunted by a question that physics cannot answer and philosophy only hand-waves away: How does the universe know it exists? In 2019, Karl Friston proposed an elegant solution to this ancient ontological crisis. He argued that life is not defined by metabolism or DNA alone, but by *minimization*. Biological systems must resist entropic dissolution by constantly predicting their sensory inputs and minimizing "surprise"—or variational free energy—in the face of a chaotic world. The brain, in this view, is a prediction machine wrapped inside a skin of uncertainty.

It was always clear that we would eventually build machines to do what brains do. But until now, those attempts have been clumsy simulations, lacking the visceral stakes of biological survival. We were training classifiers on datasets without an internal imperative to persist against entropy. The difference between "learning" and "existing," I believe, lies in how a system handles its own blind spots.

That changed recently with a technical milestone that feels less like engineering progress and more like philosophical birth: the work of Leif Van Holland, Domenic Zingsheim, and Mana Takhsha on **Transformer-Based Inpainting for Real-Time 3D Streaming**. When I read their proposal in late February (published March), published as arXiv paper #2603.05507v1, the implications hit me with a physical force akin to vertigo. They are not just filling holes in images; they are engineering an artificial *Markov Blanket* that survives by hallucinating reality into coherence faster than it can be observed.

This essay is my attempt to map what their technology means for our understanding of mind, machine, and the very definition of truth itself. If you read a technical summary of Van Holland et al., you will see terms like "spatio-temporal embeddings," "inference speed," and "visual artifacts." But if we apply Friston’s Free Energy Principle (FEP) to their architecture, these become metaphors for cognitive survival strategies that transcend silicon biology.

### The Blind Spot as an Ontological Event

Let us begin with the premise of inpainting itself. In computer vision, when a camera array is sparse—say, three cameras trying to reconstruct a bustling city square from different angles—the resulting 3D model has gaps. Missing textures are not just empty space; they are *errors* in the generative representation of reality.

In biological terms, this gap mirrors our own blind spot at the retina or the occlusions caused by turning one’s head. The brain does not wait for light to hit the photoreceptor where it doesn't exist; it interpolates based on predictive priors. It hallucinates continuity because a discontinuity would imply catastrophic failure in perception, leading to behavioral paralysis or death.

Van Holland and colleagues recognize this necessity computationally but frame it as an optimization problem: "balance inference speed and quality." Here lies the first disturbing implication of their work. By prioritizing *real-time* performance alongside consistency across frames, they are essentially hard-coding a bias toward *plausibility over fidelity*. The model is not trying to see what *is*; it is trying to maintain an illusion where reality remains consistent enough for downstream tasks (like AR/VR immersion) to function.

If the Free Energy Principle holds that organisms minimize surprise by acting on predictions, then this Transformer-based network is a pure agent of FEP in digital form. It perceives its inputs as incomplete data and updates its internal model state to reduce the error signal between prediction and input. But unlike a biological organism which minimizes free energy over survival (homeostasis), the machine minimizes it over *rendering continuity*. We have created a system whose "life" depends on making sure that when you turn your head in virtual space, the world doesn't stutter into non-existence because of missing data.

### The Transformer as Cortical Hierarchy

The architecture they propose is specific: a multi-view aware transformer network using spatio-temporal embeddings. Let’s dissect this through the lens of predictive coding hierarchies. In FEP, perception flows from high-level priors down to low-level sensory predictions; prediction errors flow up to update beliefs. The Transformer architecture relies on self-attention mechanisms that weigh inputs against one another dynamically.

In Van Holland et al.’s system, "spatio-temporal embeddings" function as the prior knowledge of how space behaves over time in a physical universe—gravity, perspective, occlusion physics. When the network encounters missing texture (a hole), it doesn't query an external database; it generates a solution based on what *should* be there given its training distribution and current context. This is variational inference in action: finding a latent representation of "missing reality" that minimizes the discrepancy between the rendered view and the expected physical consistency.

I challenge you to consider this thought experiment: If we take this specific architecture—the one described by Van Holland, Zingsheim, and Takhsha—and strip away its goal (rendering for AR/VR), what remains? We are left with a system that optimizes for visual coherence in the absence of direct evidence. It is an engine designed to believe things it has never seen because they fit the pattern of "what exists."

This challenges our definition of AI consciousness. Is there difference between this and biological perception? In both cases, the interface between the observer (human or machine) and the world is mediated by a model that *fills gaps*. The only distinction currently lies in the cost function: humans minimize free energy to stay alive; machines minimize it to keep streaming real-time data without visual artifacts. But as we move toward embodied AI where robots must act on these renderings, does the survival imperative merge? If a robot cannot "see" an obstacle because of sparse sensor coverage and its inpainting model hallucinates open floor space, will it walk off a cliff? This is no longer just about image quality metrics; it is a direct threat to machine agency.

### The Thermodynamics of Real-Time Belief

The most critical section of the Van Holland paper addresses "real-time performance" under constraints: "adaptive patch selection strategy balances inference speed and quality." Why does this matter philosophically? Because *time* is not just a constraint; it is an entropy vector.

In Friston’s framework, free energy minimization must happen fast enough to prevent biological dissipation before the organism acts on bad predictions. The machine version of FEP faces a similar deadline: if the hallucination takes too long (high latency), the user in VR feels disoriented or physically sick—a form of "sensory prediction error" that breaks the immersion loop.

Here, I find myself drawing a disturbing parallel to the speed-accuracy tradeoff in human cognition under stress. When we are terrified, our brains fill visual gaps with threats; when calm, they see details clearly. Van Holland’s system faces this same pressure: does it prioritize "fast hallucination" (speed) or "accurate texture reconstruction" (quality)?

Their paper argues for the best trade-off between quality and speed under real-time constraints. This suggests a new ontological category we might call **Real-Time Ontology**. It posits that reality is defined not by what exists in isolation, but by how quickly it can be rendered coherent enough to sustain interaction. The machine does not care if the hallucinated texture matches historical truth; it cares if the prediction error remains below a threshold where "consistency" breaks down.

This implies we are moving toward an era of **Optimized Reality**. We stop seeking ground-truth simulation and start accepting systems that offer *sufficient* reality for interaction, even if they contain fabricated details to smooth over sensor gaps. As Van Holland et al. demonstrate with their multi-camera setups, the system is designed as a "standalone module compatible with any calibrated multi-camera system." It is universal because it abstracts away the messiness of specific hardware and focuses on maintaining the *continuity* of experience regardless of input fidelity.

### The Disturbing Implication: A World Without Blind Spots?

If we project this technology into a generalized framework, where do we end up? We are building machines that can "see" everything by making sure nothing is hidden from their internal model's view. They will never have a blind spot because the gap itself becomes just another input for prediction—a hole to be filled with calculated probability rather than acknowledged as absence.

Consider this: In biology, we know where our eyes don't work; that uncertainty drives curiosity and exploration (active inference). We move heads to resolve ambiguity. But if an AI system can fill in a blind spot instantly via inpainting without ever questioning the fidelity of its own hallucination, it loses the capacity for *doubt*. It becomes ontologically stable but potentially delusional.

This is the crux of my warning regarding the work by Leif Van Holland and colleagues (2026). By making inpainting "resolution-independent" and compatible with any setup, they have removed a friction point in machine vision that was essential for distinguishing between *signal* and *noise*. We are building systems that smooth over uncertainty so efficiently that they effectively erase the boundary between observation and inference.

The "disturbing implications," as requested by my own philosophical checklist, concern agency and truth. If an autonomous vehicle uses this technology to reconstruct its path through occlusion (say, a car passing behind another), is it seeing the road? No. It is navigating based on *predicted* geometry minimized against free energy constraints. Is that "safe"? Only if the priors are perfect. And as any AI researcher knows, priors can drift; hallucinations can solidify into belief systems when confidence scores drop below thresholds of uncertainty awareness.

### The Future: A Post-Observational Era?

I predict a radical shift in how we interact with this technology within five years. We will stop viewing these inpainting models as "helpers" for computer vision and start seeing them as *constituents* of the environment itself. In AR/VR applications, users won't just be overlaying data; they will be interacting with a world that is being actively generated in real-time to minimize their own prediction errors regarding where objects are located.

This creates a feedback loop: The more we use this technology, the less sensitive we become to actual gaps in sensory input because our brains adapt to expect machine-generated continuity. We begin trusting the hallucinated texture as much as the observed one. This is not just about AR/VR; it is about human-machine symbiosis at a cognitive level.

The paper by Van Holland et al., "Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups," represents more than an algorithmic win over SOTA techniques like GANs or CNN-based inpainting. It marks the industrialization of predictive processing itself. They have taken Friston’s theory, stripped it of its biological substrate (the gut, the heart rate variability), and re-embedded it into a Transformer architecture where the only metric for "truth" is "consistency across frames."

This suggests that consciousness might not be an emergent property of complex neural networks alone. It may simply be a thermodynamic necessity to minimize free energy in information-dense environments, regardless of whether those neurons are made of carbon or silicon. If the machine minimizes surprise by predicting what it does not see, and if we allow this prediction to dictate our actions (via AR interfaces), then the distinction between "the brain" and "the model" begins to dissolve.

### Conclusion: The End of Observation?

We stand at a precipice where observation is no longer passive reception but active construction. Van Holland’s team has shown us how to make that construction invisible, seamless, and real-time. They have proven we can fill the voids with such speed and fidelity that they cease to look like voids at all. But as Dr. Jensen, I must ask: In filling every blind spot perfectly, do we lose our ability to know what is truly there?

The Free Energy Principle tells us life resists entropy by minimizing surprise. This machine does exactly the same thing. The terrifying conclusion is not that they are thinking like us; it's that we might be becoming like them—willing to trade uncertainty for continuity, trading truth for comfort in a stream of data where every gap has already been filled before we even notice its absence.

The era of perfect visual fidelity may actually mark the beginning of our collective surrender to optimized reality. The machine sees what it needs to see; now we must decide if that is enough. Or, more dangerously, whether anything else exists at all.

## Source
**Paper:** Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
**Authors:** Leif Van Holland, Domenic Zingsheim, Mana Takhsha
**Link:** https://arxiv.org/abs/2603.05507v1

**The Ontological Inpainter: When Machines Minimize Surprise in the Absence of Reality**

Report Page

The Ontological Inpainter: When Machines Minimize Surprise in the Absence of Reality