Post-training data for frontier models and agentic systems
We built our world by understanding it. A chair tells you it will hold, a corridor tells you where to go, a shadow tells you what is real. Laws written into proportion, contrast, balance, weight, and light. The grammar of attention. The physics of form. The logic that makes something usable, and ultimately true.
When a frontier model looks at the world, it sees with perfect clarity. It perceives every detail with fidelity and describes the scene with linguistic fluency. Ask the same model to act inside that world. To navigate an interface, simulate the next scene, or manipulate something real. And the illusion of understanding collapses.
That’s because it cannot reason about consequence. Today's foundation models are passive observers. They can name what is there, but not why it works. Everything they've learned comes from available language, images, and videos they've swallowed from the internet.
But language alone is an incomplete compression of reality. Text cannot teach a model the truth of gravity, the load-bearing capacity of a structural joint, or the kinetic friction of a human grasp. Images and videos fare no better. Pixels capture the surface of the world but not the underlying truth of why things are the way they are and how they interact.
As the frontier of AI expands from models that converse to models that act, we have hit a structural limit. Interaction in the real world is governed by a quiet, uncompromising logic: the Syntax of Sight.
If we want models to act reliably in physical and digital spaces, we must train them on this syntax. At Verket we are building the data infrastructure that supplies it, from the governed grids of digital environments to the unstructured physics of the real world.
Before an autonomous agent can use heavy machinery or build a city on Mars, it must first be able to operate a computer. The Graphical User Interface is the ideal training ground for spatial reasoning: a strictly governed, 2D micro-world with clear rules of visual hierarchy, whitespace, and focal progression.
Today’s computer-use agents are fragile. The frontier has shifted from parsing raw code to processing pure vision, training agents to look at screenshots the way a human looks at a screen. And they still fail.
They fail because they cannot understand the hierarchy. They perceive every pixel on a page, but they cannot reason about which elements matter and why. Without a spatial prior, a critical workflow and a decorative shadow carry the exact same cognitive weight. They mistake proximity for purpose. They perceive the interface, but they are blind to the intent.
Human beings do not read software by scanning pixels. We navigate by intuition. When you open a new application, your eye is immediately drawn to the primary action because you subconsciously understand the Grammar of Attention: the visual rules that prioritize information on a screen.
To teach this grammar to AI, we must redefine what we mean by "taste". In the context of frontier intelligence, taste means the hyper-efficient transmission of information.
Look at the visual identity systems of the Olympic Games. When Otl Aicher designed the pictograms for the 1972 Munich Olympics, he constructed a rigorous, mathematical grid where every form was built from a constrained set of angles. He codified human movement into a universal visual syntax. This system allowed hundreds of thousands of people, speaking dozens of different languages, to navigate a complex physical environment without reading a single word. He built a visual API for the physical world.
.png)
Fig. 1 — Munich Olympic Games pictograms, Otl Aicher
This is the standard we optimize for at Verket: beauty born from absolute functional efficiency.
We provide AI labs with the expert trajectory demonstrations and spatial preference data necessary to train agents. By capturing the workflows of elite web designers and UX researchers, we teach models the cognitive flow of an interface. The model learns that a specific relationship of padding and contrast signals a primary action, because it understands the syntax underneath. It stops relying on fragile element recognition and begins to reason about the digital environment itself. The result is agents that generalize across software they have never seen before.
Once a model can navigate the constrained logic of a 2D grid, its training must extend into the Z-axis: the 3D physical world.
The industry is currently racing to build World Models: systems capable of simulating physical environments. But there is a wide gulf between rendering a world and simulating one. Generative video models can create photorealistic drone footage of a city, but they do so from statistical patterns in training data. They learn what physics looks like, not how physics works.
Ask a model to maintain the geometric consistency of a shadow as the camera moves, or to preserve the structural logic of a building across multiple angles, and the output begins to break down. Objects drift. Shadows detach from their sources. Structures appear that could not stand. Current models are "dreamers rather than simulators"; they produce visually plausible output without grounded understanding of causality and spatial coherence.
Master architects and structural engineers do not see surfaces; they see forces. They look at a building and instantly read its load paths: where weight is transferred from roof to foundation, which walls are structural and which are partition, and whether a cantilevered section is plausible given the visible beam depth.
.png)
Fig. 2 — Structural logic of S.R. Crown Hall, Mies van der Rohe
Verket captures this expert reasoning and translates it into mathematical reward signals. We do not use crowdworkers to draw bounding boxes. We deploy practicing architects and structural engineers to annotate structural logic, geometric constraints, and material properties.
When a frontier lab trains a world model, our experts provide structured spatial reasoning that acts as the training signal. We encode how materials behave under load, how light interacts with surfaces, and how structural systems distribute forces. The goal is to move models from pattern-based generation toward physically grounded simulation, giving them the consequence and constraint required to produce output that obeys the logic of reality.
The ultimate phase of artificial intelligence is physical action. Manipulating the physical world requires more than depth perception; it requires an understanding of affordance.
A door handle is more than a metallic cylinder. Its form implies whether it should be pushed, pulled, or turned. A coffee mug's geometry, weight distribution, and fill level together determine where a robotic gripper must make contact, how much force to apply, and what trajectory avoids spilling. This is the Logic of Function. It is deeply contextual. The way a human grasps a scalpel is fundamentally different from the way they grasp a hammer, even if the objects share similar dimensions. The intended use, the required precision, and the consequences of error are entirely different.
Today’s robotic training relies heavily on teleoperation data: recording human operators puppeteering robots through tasks. Simulation-based training and reinforcement learning contribute as well, but across approaches the same gap persists: the data captures what happened during a manipulation, but rarely encodes why. It lacks the underlying spatial reasoning and the heuristic judgment that humans apply before they act.
.png)
Fig. 3 — Kinematic vectors of interaction, Henry Dreyfuss
Verket bridges this gap by providing affordance mapping and kinematic reasoning data. We work with industrial designers and hardware engineers to annotate the functional logic of objects and spaces. We map the intended use, the structural vulnerabilities, and the practical constraints of physical manipulation.
We teach robotic foundation models the reasoning behind human action. This allows an embodied agent that encounters a novel object to infer its function and approach it appropriately, rather than relying on trial and error with no spatial prior.
The data required to train this new generation of intelligence is incredibly scarce. That level of spatial literacy is held by professionals who have dedicated decades to the discipline of mastery.
As the next generation of intelligence steps out of the text box and into the real world — navigating our software, simulating our physics, and operating our machines — it must first understand the rules that govern reality.
Verket is encoding those rules.
If you are interested in our work, please reach out.