The Syntax of Sight

Post-training data for the frontier agentic systems and world models

When a frontier model looks at the world, it sees with perfect clarity. It perceives every detail with fidelity and describes the scene with striking linguistic fluency.

Ask the same model to act inside that world. To navigate an interface. To simulate the next scene. To manipulate something real. And the illusion of understanding collapses.

That’s because it cannot reason about consequence. Today's foundation models treat images as statistical distributions over pixels. They have mastered perception. They can name what is there. But they remain weak at topology, physics, and affordance: the structural logic of why things are the way they are and how they interact.

For the better part of a decade, we raised artificial intelligence by feeding it the internet. We compressed human knowledge into text and trained models to predict the next word. But as the frontier of AI expands from models that converse to models that act, we have hit a structural limit.

Language alone is an incomplete compression of reality. Text cannot teach a model the truth of gravity, the load-bearing capacity of a structural joint, or the kinetic friction of a human grasp. Interaction in the real world is governed by a quiet, uncompromising logic: the Syntax of Sight.

If we want models to act reliably in physical and digital spaces, we must train them on this syntax. At Verket we are building the data infrastructure that supplies it, from the governed grids of digital interfaces to the unstructured physics of the real world.

Phase I: The Grammar of Attention (Agents)

Before an autonomous agent can use heavy machinery or build a city on Mars, it must first be able to operate a computer. The Graphical User Interface is the ideal training ground for spatial reasoning: a strictly governed, 2D micro-world with clear rules of visual hierarchy, whitespace, and focal progression.

Today’s computer-use agents are fragile. The latest generation has moved beyond reading raw HTML or relying on static pixel coordinates. They now use vision to look at screenshots the way a human looks at a screen. And they still fail.

They fail because they cannot understand the hierarchy. They can perceive every pixel on a page, but they cannot reason about which elements matter and why. They mistake a secondary link for a primary action because they lack a spatial prior for digital environments.

Human beings do not read software by scanning pixels. We navigate by intuition. When you open a new application, your eye is immediately drawn to the primary action because you subconsciously understand the Grammar of Attention: the visual rules that prioritize information on a screen.

To teach this grammar to AI, we must redefine what we mean by "taste". In the context of frontier intelligence, taste means the hyper-efficient transmission of information.

Look at the visual identity systems of the Olympic Games. When Otl Aicher designed the pictograms for the 1972 Munich Olympics, he constructed a rigorous, mathematical grid where every form was built from a constrained set of angles. He codified human movement into a universal visual syntax. This system allowed hundreds of thousands of people, speaking dozens of different languages, to navigate a complex physical environment without reading a single word. He built a visual API for the physical world.

Fig. 1 — Munich Olympic Games pictograms, Otl Aicher

This is the standard we optimize for at Verket. Beauty born from absolute functional efficiency.

We provide AI labs with the expert trajectory demonstrations and spatial preference data necessary to train UI agents. By capturing the workflows of elite product designers and UX researchers, we teach models the cognitive flow of an interface. The model learns that a specific relationship of padding and contrast signals a primary action, because it understands the syntax underneath. It stops relying on fragile element recognition and begins to reason about the digital environment itself. The result is agents that generalize across software they have never seen before.

Phase II: The Physics of Form (World Models and Simulation)

Once a model can navigate the constrained logic of a 2D grid, its training must extend into the Z-axis: the 3D physical world.

The industry is currently racing to build World Models: systems capable of simulating physical environments. But there is a wide gulf between rendering a world and simulating one. Generative video models can create photorealistic drone footage of a city, but they do so from statistical patterns in training data. They learn what physics looks like, not how physics works.

Ask a model to maintain the geometric consistency of a shadow as the camera moves, or to preserve the structural logic of a building across multiple angles, and the output begins to break down. Objects drift. Shadows detach from their sources. Structures appear that could not stand. Current models are "dreamers rather than simulators"; they produce visually plausible output without grounded understanding of causality and spatial coherence.

Master architects and structural engineers do not see surfaces; they see forces. They look at a building and instantly read its load paths: where weight is transferred from roof to foundation, which walls are structural and which are partition, and whether a cantilevered section is plausible given the visible beam depth.

Fig. 2 — Structural logic of S.R. Crown Hall, Mies van der Rohe

Verket captures this expert reasoning and translates it into machine-readable training data. We do not use crowd-workers to draw bounding boxes. We deploy practicing architects and structural engineers to annotate structural logic, geometric constraints, and material properties.

When a frontier lab trains a world model, our experts provide structured spatial reasoning that acts as the training signal. We encode how materials behave under load, how light interacts with surfaces, and how structural systems distribute forces. The goal is to move models from pattern-based generation toward physically grounded simulation. Giving them the consequence and constraint required to produce output that obeys the logic of reality.

Phase III: The Logic of Function (Embodied Robotics)

The final phase of spatial intelligence is physical action. Manipulating the physical world requires more than depth perception. It requires an understanding of affordance.

A door handle is more than a metallic cylinder. Its form implies whether it should be pushed, pulled, or turned. A coffee mug's geometry, weight distribution, and fill level together determine where a robotic gripper must make contact, how much force to apply, and what trajectory avoids spilling. This is the Logic of Function. It is deeply contextual. The way a human grasps a scalpel is fundamentally different from the way they grasp a hammer, even if the objects share similar dimensions. The intended use, the required precision, and the consequences of error are entirely different.

Today’s robotic training relies heavily on teleoperation data: recording human operators puppeteering robots through tasks. Simulation-based training and reinforcement learning contribute as well, but across approaches the same gap persists: the data captures what happened during a manipulation, but rarely encodes why. It lacks the underlying spatial reasoning and the heuristic judgment that humans apply before they act.

Fig. 3 — Kinematic vectors of interaction, Henry Dreyfuss

Verket bridges this gap by providing affordance mapping and kinematic reasoning data. We work with industrial designers and hardware engineers to annotate the functional logic of objects and spaces. We map the intended use, the structural vulnerabilities, and the practical constraints of physical manipulation.

We teach robotic foundation models the reasoning behind human action. This allows an embodied agent that encounters a novel object to infer its function and approach it appropriately, rather than relying on trial and error with no spatial prior.

The Infrastructure of Spatial Truth

The data required to train this new generation of intelligence is scarce. That level of spatial literacy is held by professionals who have dedicated decades to the discipline of mastery.

As the next generation of intelligence steps out of the text box and into the real world — navigating our software, simulating our physics, and operating our machines — it must first understand the rules that govern reality.



Verket is writing those rules.

If you are interested in our work, please reach out.

© 2025 VERKET AI. STOCKHOLM, SWEDEN.